URL of Video: YouTube Video

contribution: 50/50

Part1 - Introduction

1.1 - Project Background and Purpose

In this project, our objective is to illustrate the process of modeling by using R machine learning functions. We aim to understand and apply various stages of data science including data preparation, model building, and model evaluation. We have choosen the YouTube dataset which is shared by uni coordinator. We also use the same dataset in our Project 1. The specific contributions of each member will be separated 50-50 fairly.

1.2 Data Source and Characteristics

The data set analyzed can be obtained from Kaggle platform. It comes from the “Global YouTube Statistics 2023”. (https://www.kaggle.com/datasets/nelgiriyewithana/global-youtube-statistics-2023)

A collection of YouTube giants, this dataset offers a perfect avenue to analyze and gain valuable insights from the luminaries of the platform. With comprehensive details on top creators’ subscriber counts, video views, upload frequency, country of origin, earnings, and more, this treasure trove of information is a must-explore for aspiring content creators, data enthusiasts, and anyone intrigued by the ever-evolving online content landscape. Immerse yourself in the world of YouTube success and unlock a wealth of knowledge with this extraordinary dataset [1].

Part2 - Setup

2.1 Load required R packages

We load ‘ggplot2’ package [2] for making static graphics and ‘dplyr’ package [3] for data manipulation.

We employ scatterplot matrices and correlation heatmaps for exploratory data analysis to understand the pairwise relationships between variables [4].

We use the skimr package to obtain more comprehensive descriptive statistics of the numeric variables in the dataset [5].

library(ggplot2)
library(gridExtra)
library(dplyr)
library(ggthemes)
library(numform)
library(treemapify)
library(timeDate)
library(lubridate)
library(reshape2)
library(ca)
library(skimr)
library(janitor)
library(flextable)
library(shiny)
library(leaflet)
library(maps)
library(RColorBrewer)
library(scales)
library(readr)
library(forecast)
library(caret)
library(knitr)
library(ROCR)
library(ROCit)
library(nortest)
library(pROC)
library(glmnet)
library(car)
library(nnet)
library(rpart)
library(rpart.plot)
library(pander)
library(e1071)
library(randomForest)
library(fpc)

2.2 Set plot theme

The theme is set up to enhance the readability and interpretability of the graphs, following the best practices in data visualization [6].

project_theme <- theme(
  panel.background = element_rect(fill = "#FFFBDC"),  # Light yellow background
  panel.grid.major = element_line(color = "#FFE4A1"), # Light orange major grid lines
  panel.grid.minor = element_blank(), # Remove minor grid lines
  plot.title = element_text(size = 18, hjust = 0.5, color = "darkblue"),  # Title color and size
  axis.title = element_text(size = 16, color = "darkblue"),  # Axis title color and size
  axis.text = element_text(size = 14, color = "black"),   # Axis text color and size
  legend.title = element_text(size = 16, color = "darkblue"), # Legend title color and size
  legend.text = element_text(size = 14, color = "black"),   # Legend text color and size
  legend.background = element_rect(fill = "#FFFBDC"),  # Legend background color
  plot.background = element_rect(fill = "#FFFBDC")   # Background color of the entire plot
)

2.3 Load the main data

In this part, we will load the main dataset, store the path in ‘Data_path’

data_path <- 'Global YouTube Statistics.csv'
raw_data <- read.csv(data_path,encoding = "UTF-8")

2.4 Dataset initial overview

Before starting in-depth analysis and modeling, it is crucial to perform an initial overview of the dataset. By using functions such as summary, str, etc., we can gain an overall understanding of the data, including data types, data distribution, missing values and possible outliers.

This is the process that we already done in Project 1, according to Professor Wei, we don’t need to do it again so we pass here. So we shorten the content about dataset overview. Mainly is using str; summary;head command to check dataset overview, we also use a function to check dataset skewness. We also shorten or even skip the EDA part in Project 2 report.

First, we use str command to analyze the data

The “Global YouTube Statistics” dataset contains 995 observations and 28 variables.

Then we would use summary command to analyze data.

we use skimr here, which is similar to summary command but it shows more descriptive statistics for numeric variables than summary command.

skimr::skim(raw_data) |>
  flextable::flextable() 

We can see from the standard output that ‘video_views_rank’ has 1 missing value, country_rank has 116, ‘subscribers_for_last_30_days’ has 337 the most in the whole dataset. The numbers of missing value of ‘Gross.tertiary.education.enrollment.rate’,‘Population’, ‘Unemployment.rate’, ‘Urban_population’, ‘Latitude’ are the same , so it should be a systematic missing issue. So we need to pay attention to this issue in the following data cleaning. According to the numeric.hist, it seems most of the numeric variables are right skewnessed.The ‘Unemployment.rate’ does not look like a normal distribution,so as ‘Longitude’.

Skewness analysis is crucial as it helps in understanding the symmetry of the data distribution [7]. We observe a higher percentage of right-skewed columns in our dataset, which implies that most numerical variables in our dataset have a distribution that has a tail on the right side, meaning the right side of the distribution is longer or fatter than the left side. This can impact the analysis and model performance because many statistical methods assume a normal distribution of the data. Right-skewed distributions are associated with higher values and outliers on the right side of the distribution which can greatly impact the mean and variance, and subsequently, distort the interpretation and performance of the model.

Addressing skewness is often important in regression models where the assumption of normally distributed residuals is made. Therefore, recognizing the presence of skewness can help us make informed decisions about data transformation techniques, such as logarithmic transformation or Box-Cox transformation, to attempt to normalize the data distribution before applying any modeling techniques. For example, this can be particularly relevant for variables like subscribers, uploads, and lowest_monthly_earnings, where skewness might impact the model’s ability to generalize well to unseen data and might lead to unreliable predictions or insights.

Calculate Right skewness numbers and left skewness numbers

calculate_skewness_percentage <- function(dataframe) {
  #initialize function variables
  right_skewed_count <- 0
  left_skewed_count <- 0
  column_names <- names(dataframe)
  #count the numbers with loops
  for (column in column_names) {
    if (is.numeric(dataframe[[column]])) {
      median_value <- median(dataframe[[column]], na.rm = TRUE)
      mean_value <- mean(dataframe[[column]], na.rm = TRUE)
      if (mean_value > median_value) {
        right_skewed_count <- right_skewed_count +1
      } else if (mean_value < median_value) {
        left_skewed_count <- left_skewed_count +1
      }
    }
  }
  #calculate percentage with the count result
  right_percentage <- round((right_skewed_count / ncol(dataframe)) * 100, 2)
  left_percentage <- round((left_skewed_count / ncol(dataframe)) * 100, 2) 
  
  return(list(right_percentage, left_percentage))  
}

# Use function to get the calculation result
results <- calculate_skewness_percentage(raw_data)
cat("\n")
cat("Percentage of right-skewed columns is:", results[[1]], "%\n")
## Percentage of right-skewed columns is: 50 %
cat("Percentage of left-skewed columns is:", results[[2]], "%\n")
## Percentage of left-skewed columns is: 21.43 %
rm(results)

The outcomes of our skewness analysis are telling and significant, revealing that 50% of the columns in our dataset are right-skewed, in contrast to a mere 21.43% being left-skewed. This preponderance of right-skewed columns is pivotal as it unveils a tendency for the distribution of most numerical variables to be tail-heavy on the right side, a characteristic that might have consequential implications for subsequent analyses and modeling efforts.

The presence of such skewed columns underlines the existence of numerous higher values and potential outliers on the right side of the distribution. These extreme values have the capability to significantly inflate the mean and variance of the data, potentially leading to distorted interpretations and undermining the reliability and robustness of our subsequent statistical models.

The substantial right skewness observed in specific variables such as subscribers, uploads, and lowest_monthly_earnings could particularly impact the predictive power and generalizability of the models. Such skewness could render the models less resilient and adaptive to unseen or new data, thereby possibly yielding unreliable predictions or insights.

Employing such transformations could help mitigate the impact of skewness and align the data more closely with the assumptions of many statistical methods, predominantly enhancing the reliability and validity of our subsequent analyses and models.

Then we use head command to keep analyzing data.

Base on the standard output, we can see that subscribers shows how many people subscribe the Youtuber. The Title is the same as Youtuber, we need to figure out why it is the same but in two different columns. the Music Youtuber has 119000000 subsribers but only 0 uploads , so we need to make data cleaning later.

Part3 - Data Cleaning and Preparation

3.1 Data Cleaning

3.1.1 Check column names and make adjustments

First, we use names() command to check each variable name.

names(raw_data)

Then, we use janitor package and rename function adjust variable names to ensure uniform formatting.

janitor::clean_names() is a function in the package whose main purpose is to convert the column names of dataframes into a clean, uniform format. Here is the main logic and rules of this function: 1.All characters to lowercase 2.Spaces are converted to underscores 3.Non-alphanumeric characters are removed 4.Numbers before text moved to the end 5.Underline after keyword in R)

raw_data <- janitor::clean_names(raw_data)

From the output result we can see most of the variable names have been adjusted properly. But some of them still need manual adjustment for better understanding and clearness.

raw_data <- raw_data %>%
  rename(
    #More descriptive names
    video_views_last_30_days = video_views_for_the_last_30_days,
    country_abbr = abbreviation,
    #Shorter names
    monthly_earnings_low = lowest_monthly_earnings,
    monthly_earning_high = highest_monthly_earnings,
    yearly_earning_low = lowest_yearly_earnings,
    yearly_earning_high = highest_yearly_earnings,
    subs_last_30_days = subscribers_for_last_30_days,
    tertiary_edu_enrollment = gross_tertiary_education_enrollment,
    )
print(colnames(raw_data))
##  [1] "rank"                     "youtuber"                
##  [3] "subscribers"              "video_views"             
##  [5] "category"                 "title"                   
##  [7] "uploads"                  "country"                 
##  [9] "country_abbr"             "channel_type"            
## [11] "video_views_rank"         "country_rank"            
## [13] "channel_type_rank"        "video_views_last_30_days"
## [15] "monthly_earnings_low"     "monthly_earning_high"    
## [17] "yearly_earning_low"       "yearly_earning_high"     
## [19] "subs_last_30_days"        "created_year"            
## [21] "created_month"            "created_date"            
## [23] "tertiary_edu_enrollment"  "population"              
## [25] "unemployment_rate"        "urban_population"        
## [27] "latitude"                 "longitude"

3.1.2 Inspect & Adjust data types

Use sapply function to check data types, adjust wrong data typs.

From the standard output first, we know that we need to handle the missing values; ‘created_month’ should change from char type to numeric type this is useful for following analysis; ‘category’,‘country’,‘channel_type’,‘abbreviation’ shoule be factor type which is better for following analysis;handle the outliers according to the skimmr result.

3.1.3 Handle Missing value

We’ve identified three representations for missing data in our dataset: “0”, “NaN”, and “nan”. Our goal is to standardize these by converting them to NA. Following this conversion, we aim to tally the number of NA values in each column. It’s our assumption that “0” serves as a sentinel value indicating missing data.

#Change 0 or 'nan' or 'NaN' to NA value
#Char columns
raw_data[raw_data == 'nan' | raw_data == "NaN" | raw_data == 0] <- NA
#Numeric columns
raw_data <- raw_data |>
  mutate(across(where(is.numeric), ~ifelse(is.nan(.), NA, .)))

Use function to calculate how many missing values in each column, handle them.

# Calculate missing value number
count_missing <- function(df) {
  sapply(df, FUN = function(col) sum(is.na(col)) )
}
# Calculate percentage of missing value
percentage_missing <- function(df) {
  sapply(df, FUN = function(col) round(sum(is.na(col)) / length(col) * 100, 2) )
}

nacounts <- count_missing(raw_data)
napercents <- percentage_missing(raw_data)

# output the result
hasNA = which(nacounts > 0)
data.frame(Column = names(nacounts[hasNA]), 
           Missing_Values = nacounts[hasNA], 
           Percentage_Missing = napercents[hasNA])
##                                            Column Missing_Values
## video_views                           video_views              8
## category                                 category             46
## uploads                                   uploads             43
## country                                   country            122
## country_abbr                         country_abbr            122
## channel_type                         channel_type             30
## video_views_rank                 video_views_rank              1
## country_rank                         country_rank            116
## channel_type_rank               channel_type_rank             33
## video_views_last_30_days video_views_last_30_days             56
## monthly_earnings_low         monthly_earnings_low            118
## monthly_earning_high         monthly_earning_high             89
## yearly_earning_low             yearly_earning_low             89
## yearly_earning_high           yearly_earning_high             79
## subs_last_30_days               subs_last_30_days            337
## created_year                         created_year              5
## created_month                       created_month              5
## created_date                         created_date              5
## tertiary_edu_enrollment   tertiary_edu_enrollment            123
## population                             population            123
## unemployment_rate               unemployment_rate            123
## urban_population                 urban_population            123
## latitude                                 latitude            123
## longitude                               longitude            123
##                          Percentage_Missing
## video_views                            0.80
## category                               4.62
## uploads                                4.32
## country                               12.26
## country_abbr                          12.26
## channel_type                           3.02
## video_views_rank                       0.10
## country_rank                          11.66
## channel_type_rank                      3.32
## video_views_last_30_days               5.63
## monthly_earnings_low                  11.86
## monthly_earning_high                   8.94
## yearly_earning_low                     8.94
## yearly_earning_high                    7.94
## subs_last_30_days                     33.87
## created_year                           0.50
## created_month                          0.50
## created_date                           0.50
## tertiary_edu_enrollment               12.36
## population                            12.36
## unemployment_rate                     12.36
## urban_population                      12.36
## latitude                              12.36
## longitude                             12.36
rm(col,hasNA,nacounts,napercents)
## Warning in rm(col, hasNA, nacounts, napercents): object 'col' not found

Our dataset examination reveals pervasive missing data across multiple columns. Given that our dataset comprises only 995 rows, a sweeping removal of missing values could severely curtail the dataset’s utility for subsequent modeling. In light of this, we’ve devised tailored strategies for addressing different variable types.

Before delving into the strategies for managing missing values, it’s crucial to outline which variables we have decided not to process and provide the rationale behind these choices:

1.Abbreviation: Although it’s feasible to replace its missing entries with ‘Unknown’, a total of 122 missing entries renders it impractical for further processing.

2.Latitude & Longitude: These columns lack intrinsic geographical data. With missing data exceeding 10% of the total, they are deemed not fit for subsequent modeling tasks.

3.Rank, Subscribers, Title & Youtuber: These columns will not be part of our processing pipeline since they manifest no missing values.

Now, let’s delve into the specifics of our chosen strategies for different columns:

1.Video Views: With a scant presence of missing values, we grappled with options between row deletion or median imputation. Given our preference to retain as many rows as possible and the data’s skewed distribution, we’ve opted for median imputation.

2.Country & Abbreviation: The magnitude of missing values (122 in total) poses challenges. Hence, rather than impute, we’ll earmark these columns for exclusion from subsequent analyses.

3.Earnings: Given its skewed distribution, our chosen strategy leans towards median imputation.

4.Latitude & Longitude: These will be excluded from further analysis due to substantial missing values and an absence of auxiliary geographical information to assist imputation.

5.Time Variables: These aren’t continuous time series, ruling out techniques like ARIMA modeling. Given the paucity of missing data (only 5 rows), which equates to a mere 0.5% of the dataset, we’ve resolved to delete rows with missing time entries, without attempting imputation.

The numeric columns slated for processing are as follows: c(“video_views”, “uploads”, “video_views_rank”, “country_rank”, “channel_type_rank”, “video_views_last_30_days”, “monthly_earnings_low”, “monthly_earning_high”, “yearly_earning_low”, “yearly_earning_high”, “subs_last_30_days”, “tertiary_edu_enrollment”, “population”, “unemployment_rate”, “urban_population”).

The categorical columns in our crosshairs are: c(“category”, “country”, “country_abbr”, “channel_type”).

Our immediate focus will be on the aforementioned numeric columns. Leaning on the outcomes of our skewness analysis, which indicated a predominantly right-skewed data distribution, we’ve favored median imputation over mean imputation. This is predicated on the fact that the median, unlike the mean, won’t perturb the right-skewed nature of the data.

In summary, our overarching strategy centers on the preservation of data integrity, and our chosen imputation methods are reflective of each column’s inherent properties and data distribution.

change_to_median_cols <- c("video_views",
"uploads","video_views_rank","country_rank","channel_type_rank","video_views_last_30_days","monthly_earnings_low",  "monthly_earning_high","yearly_earning_low",     "yearly_earning_high","subs_last_30_days","tertiary_edu_enrollment","population","unemployment_rate","urban_population")
for (col in change_to_median_cols) {
    median_val <- median(raw_data[[col]], na.rm = TRUE)
    raw_data[[col]][is.na(raw_data[[col]])] <- median_val
}
rm(col,median_val,change_to_median_cols)

Change the NAs in character variables into “Unknown”

character_columns <- c("category", "country","country_abbr","channel_type")
  
for (col in character_columns) {
  raw_data[[col]][is.na(raw_data[[col]])] <- "Unknown"
}
rm(col,character_columns)

Delete all the missing rows (5 rows) in the time column. Since there are only 5 rows, it does not affect the subsequent analysis. At the same time, delete the rows with the channel creation year less than 2005, because the YouTube company was created in 2005.

raw_data <- raw_data %>%
  filter(!is.na(created_year) & !is.na(created_month) & !is.na(created_date) & created_year >= 2005)

Overall, only 6 rows of data were deleted, which does not affect the subsequent analysis because it only accounts for 0.6% of the data.

3.1.4 Handle repeating lines

# Use unique function handle repeating lines
raw_data_unique <- unique(raw_data)
raw_data <- raw_data_unique
rm(raw_data_unique)

3.1.5 Handle outliers

3.1.5.1 classify variables

Change the variables into numerical and categorical according to the data type.

numeric_vars <- raw_data[sapply(raw_data,is.numeric)]
categorical_vars <- raw_data[sapply(raw_data,function(x) class(x) %in% c('factor','character'))]
numeric_vars_name <- names(numeric_vars)
categorical_vars_name <- names(categorical_vars)

3.1.5.2 Identify outliers

Create a function to identify outliers for all variables in the dataset

3.1.5.3 Winsorize outliers

To deal with extreme values in the dataset, we used the method of Winsorizing (shrinking the tails), which sets values greater than the 95th percentile in the dataset to the 95th percentile value. This method has unique advantages over other methods of dealing with outliers, such as truncation or deletion.

First, it preserves much of the original information of the data. Second, truncation reduces the potential adverse effects of extreme values or outliers on the model, thus improving the stability and predictive accuracy of the model.

In addition, some studies have also shown that the results of statistical hypothesis validation and model fit of the model can be effectively improved by appropriate outlier treatment [@Dixon1960].

First,identify which variables are numerical or categorical and set the variables of the category column and the variables of the numeric column

# Function to truncate values in specified columns that exceed a given percentile.
truncate_tail <- function(data, column_names, percentile = 95) {
  
  # For each column, replace values above the specified percentile with the value at that percentile.
  for (col in column_names) {
    q <- quantile(data[[col]], probs = percentile / 100, na.rm = TRUE)
    data[complete.cases(data) & data[[col]] > q, col] <- q
  }
  
  return(data)
}

# Apply the function to the first 18 columns of Youtube_cleaned.
raw_data <- truncate_tail(raw_data, numeric_vars_name, percentile = 95)

Secondly,summarize how much outliers in each variable

# Function to compute z-scores and detect outliers based on a specified threshold
detect_outliers_zscore <- function(data, column_names, threshold = 3) {
  outliers_count <- sapply(column_names, function(col_name) {
    # Calculate z-scores for each specified column
    z_scores <- (data[[col_name]] - mean(data[[col_name]], na.rm = TRUE)) / sd(data[[col_name]], na.rm = TRUE)
    # Count the number of absolute z-scores that exceed the provided threshold
    sum(abs(z_scores) > threshold, na.rm = TRUE)
  })
  return(outliers_count)
}

# Detect outliers in the data after truncation using the function above
outliers_count_after_truncate <- detect_outliers_zscore(raw_data, numeric_vars_name)
print(outliers_count_after_truncate)
##                     rank              subscribers              video_views 
##                        0                        4                        1 
##                  uploads         video_views_rank             country_rank 
##                       54                        0                       55 
##        channel_type_rank video_views_last_30_days     monthly_earnings_low 
##                       53                        2                        2 
##     monthly_earning_high       yearly_earning_low      yearly_earning_high 
##                        2                        2                        2 
##        subs_last_30_days             created_year             created_date 
##                        5                        0                        0 
##  tertiary_edu_enrollment               population        unemployment_rate 
##                        0                        0                        0 
##         urban_population                 latitude                longitude 
##                        0                       16                        0
rm(outliers_count_after_truncate,numeric_vars,categorical_vars)

Ultimately, we observe a reduction in the impact of outliers, but it’s impractical to address all of them. Overcorrecting could potentially distort the data. With the cleaning phase now complete, we’re ready to transition to the classification stage.

3.1.6 Change data type

Change data type from character to factor for ‘category’, ‘country’, ‘channel type’.

Change some categorical variable to factors to prepare for Classification.

factor_columns <- c("category", "country","country_abbr", "channel_type")
#Use mutate adjust factor_columns, change then to factor type
raw_data <- raw_data %>%
  mutate(across(all_of(factor_columns), as.factor))
rm(factor_columns)

3.1.7 Backup the processed data

Backup the raw_data into another name youtube_data

youtube_data <- raw_data

clustering_data <- raw_data

clustering_data <- raw_data 

3.1.8 Add new columns

Add new columns which might be used in the future.

3.1.8.1 Full time column

# First, ensure that the 'created_date' column is in integer format
# Assuming the data doesn't contain any decimal parts or NA values
youtube_data$created_date <- as.integer(as.character(raw_data$created_date))

# Check for any non-integer values in 'created_date'
if (any(!is.finite(youtube_data$created_date))) {
  stop("The 'created_date' column contains non-integer or NA values.")
}

# Define a vector of month names
month_names <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")

# Create a vector of two-digit month numbers using sprintf
month_numbers <- sprintf("%02d", 1:12)

# Create a mapping between month names and month numbers
month_mapping <- setNames(month_numbers, month_names)

# Update the 'full_date' column by combining year, month, and day
youtube_data$full_time <- as.Date(paste(raw_data$created_year, 
                                   raw_data$created_month %>% match(month_names) %>% month_mapping[.],
                                   sprintf("%02d", youtube_data$created_date), 
                                   sep = "-"), 
                             format = "%Y-%m-%d")
rm(month_mapping, month_names, month_numbers)

3.1.8.2 Subscriber engagement column

video_views/subscribers. This can give an idea of how many videos, on average, each subscriber watches, offering a measure of engagement for the channel.

youtube_data$subs_engagement <- youtube_data$video_views/youtube_data$subscribers

3.1.8.3 Add video upload frequency column

uploads/(2023-created_year). This can show how many videos a YouTuber uploads on average each year since they created their channel.

youtube_data$uploads_frequency <- youtube_data$uploads/(2023-youtube_data$created_year)

3.1.8.4 Average views per video column

video_views/uploads. This can help in understanding the average number of views per video, offering a measure of the content’s popularity.

youtube_data$average_views_per_video <- youtube_data$video_views/youtube_data$uploads

3.1.8.5 Urbanization rate column

urban_population/population. This will give a percentage of urbanization in a country.

youtube_data$urban_rate <- youtube_data$urban_population/youtube_data$population

3.1.8.6 Years of experience column

2022 - created_year. The number of years since the channel was created, which might relate to income and popularity.

youtube_data$age <- 2022-youtube_data$created_year

3.1.8.7 Education population ratio column

tertiary_edu_enrollment/population. This gives a ratio of the population that’s undergoing higher education in each country

youtube_data$edu_ratio <- youtube_data$tertiary_edu_enrollment/youtube_data$population

3.1.8.8 Top channel flag column

A binary feature based on rank (if channel rank < 1000 then 1, else 0).

youtube_data$top_channel_flag <- as.numeric(youtube_data$channel_type_rank <1000)

3.1.8.9 Video views increasing rate column

video_views_last_30_days/ video_view. This shows the increasing rate about video views.

youtube_data$video_views_increasing_rate <- (youtube_data$video_views_last_30_days/youtube_data$video_views)

3.1.8.3 Add average earning columns

Average earning for each month and each year.

youtube_data$monthly_earning <-( youtube_data$monthly_earning_high+youtube_data$monthly_earnings_low)/2
youtube_data$yearly_earning <-( youtube_data$yearly_earning_low+youtube_data$yearly_earning_high)/2

Observation found that among the overall monthly income and annual income, there are rows where the monthly income is much greater than the annual income. The reason for this problem is that some rows with the lowest monthly income have na values, which turned into average values during previous processing, resulting in this unreasonable situation.

Create a function to identify all these unreasonable rows

detect_unusual_rows <- function(data) {
  unusual_rows <- data[data$monthly_earning > data$yearly_earning, c("monthly_earning", "yearly_earning","monthly_earnings_low","monthly_earning_high","yearly_earning_low","yearly_earning_high")]
  print(unusual_rows)
  cat("The number of unreasonable rows are", nrow(unusual_rows), "rows\n")
}
detect_unusual_rows(youtube_data)
##     monthly_earning yearly_earning monthly_earnings_low monthly_earning_high
## 2          8900.025          0.310                17800                 0.05
## 15         8900.035          0.455                17800                 0.07
## 17       138975.000      97550.025                17800            260150.00
## 19         8900.030          0.410                17800                 0.06
## 28         8900.020          0.255                17800                 0.04
## 52       138975.000      97550.025                17800            260150.00
## 63         8900.010          0.100                17800                 0.02
## 76         8900.010          0.130                17800                 0.02
## 78         8900.005          0.055                17800                 0.01
## 135      138975.000      97550.025                17800            260150.00
## 173      138975.000      97550.025                17800            260150.00
## 174        8900.030          0.385                17800                 0.06
## 183        8900.005          0.055                17800                 0.01
## 200        8900.030          0.355                17800                 0.06
## 220        8900.005          0.075                17800                 0.01
## 233        8900.020          0.280                17800                 0.04
## 235        8900.005          0.055                17800                 0.01
## 251        8900.010          0.130                17800                 0.02
## 263        8900.005          0.075                17800                 0.01
## 337        8900.005          0.075                17800                 0.01
## 443        8900.015          0.180                17800                 0.03
## 449      138975.000      97550.025                17800            260150.00
## 477        8900.010          0.100                17800                 0.02
## 490        8900.010          0.100                17800                 0.02
## 542      138975.000      97550.025                17800            260150.00
## 545        8900.015          0.180                17800                 0.03
## 567      138975.000      97550.025                17800            260150.00
## 586        8900.010          0.100                17800                 0.02
## 751        8900.005          0.055                17800                 0.01
## 795        8900.015          0.180                17800                 0.03
## 803      138975.000      97550.025                17800            260150.00
## 825        8900.010          0.130                17800                 0.02
## 854        8900.005          0.075                17800                 0.01
## 858        8900.005          0.055                17800                 0.01
## 886        8900.005          0.055                17800                 0.01
## 899        8900.005          0.055                17800                 0.01
## 930      138975.000      97550.025                17800            260150.00
## 962      138975.000      97550.025                17800            260150.00
## 981        8900.005          0.055                17800                 0.01
##     yearly_earning_low yearly_earning_high
## 2                 0.04                0.58
## 15                0.05                0.86
## 17           195100.00                0.05
## 19                0.05                0.77
## 28                0.03                0.48
## 52           195100.00                0.05
## 63                0.01                0.19
## 76                0.02                0.24
## 78                0.01                0.10
## 135          195100.00                0.05
## 173          195100.00                0.05
## 174               0.05                0.72
## 183               0.01                0.10
## 200               0.04                0.67
## 220               0.01                0.14
## 233               0.03                0.53
## 235               0.01                0.10
## 251               0.02                0.24
## 263               0.01                0.14
## 337               0.01                0.14
## 443               0.02                0.34
## 449          195100.00                0.05
## 477               0.01                0.19
## 490               0.01                0.19
## 542          195100.00                0.05
## 545               0.02                0.34
## 567          195100.00                0.05
## 586               0.01                0.19
## 751               0.01                0.10
## 795               0.02                0.34
## 803          195100.00                0.05
## 825               0.02                0.24
## 854               0.01                0.14
## 858               0.01                0.10
## 886               0.01                0.10
## 899               0.01                0.10
## 930          195100.00                0.05
## 962          195100.00                0.05
## 981               0.01                0.10
## The number of unreasonable rows are 39 rows

It can be found that these unreasonable rows account for less than 5% of all rows.To ensure the data integrity of classification,these rows can be deleted. According to the discussion with Sirui Liu in Teams, it is acceptable to delete rows less than 5% in total. For now we delete 45 rows(4.6%).

youtube_data <- youtube_data[youtube_data$monthly_earning <= youtube_data$yearly_earning, ]

3.2 Data Preparation

3.2.1 Object and features

In this project, we would build model and predict the high earning Youtuber on the top list.

For now we pick some features like: “subscribers”,“video_views”,“uploads”,“channel_type”,“created_year”, “subs_engagement”,“uploads_frequency”,“average_views_per_video”,“urban_rate”,“age”,“edu_ratio”,“top_channel_flag”,“video_views_increasing_rate”

Predict variable:high/low earning

All the features and category variables have been prepared in 3.1.8 Section.

Data Integration:

We would use the World Bank GDP data from the Project2 prject description, which includes GDP per capita for each country for each year. Based on the country field merge the GDP for 2022 year. We decided to use 2022 year GDP for each country as a reference. Because our Youtuber data is for 2023, 2022 is the closest year to 2023 in the GDP data, which ensures that real-word economic fluctuations will have the least impact because the years are closest.

After manually check, we realized that the first 4 rows of the GDP csv file are not data columns, so we skipped the first 4 rows using the skip parameter. When loading, our GPD data would only load up to the 2005 GDP and could not load the rest right side columns. So we solved the problem by using ChatGPT4 to learn how to select specific columns using the readr package.

Before merge there are some Country.Names that different from the Country.Name in youtube dataset.Change them to the same countries

(diff_countries = setdiff(youtube_data$country, gdp_data$Country.Name))
## [1] "Unknown"     "Russia"      "South Korea" "Turkey"      "Venezuela"  
## [6] "Egypt"
gdp_data$Country.Name[gdp_data$Country.Name == "Egypt, Arab Rep."] <- "Egypt"
gdp_data$Country.Name[gdp_data$Country.Name == "Korea, Rep."] <- "South Korea"
gdp_data$Country.Name[gdp_data$Country.Name == "Russian Federation"] <- "Russia"
gdp_data$Country.Name[gdp_data$Country.Name == "Turkiye"] <- "Turkey"
gdp_data$Country.Name[gdp_data$Country.Name == "Venezuela, RB"] <- "Venezuela"

Change all the GDP NA into that year of average GDP value

replace_na_with_global_median <- function(data) {
  years <- colnames(data)[-1]  
  global_median_gdp <- numeric(length(years))
  
    # change GDP variable into numeric
  for (i in 2:ncol(data)) {
    data[, i] <- as.numeric(data[, i])
  }
  
  # Calculate global median GDP for each year
  for (i in 1:length(years)) {
    global_median_gdp[i] <- median(data[!is.na(data[, years[i]]), years[i]], na.rm = TRUE)
  }
  
  data[, years] <- t(apply(data[, years], 1, function(row) {
    is_na <- is.na(row)
    row[is_na] <- global_median_gdp[is_na]
    return(row)
  }))
  
  return(data)
}


gdp_data <- replace_na_with_global_median(gdp_data)

Calculate GDP value based on Youtube channel age

# Divide the per capital GDP value in those years of channel age by the channel age in these years
calculate_channel_age_adjusted_income <- function(youtube_data, gdp_data) {
  
  adjusted_income <- numeric(nrow(youtube_data))
  
  # Iterate through each row of youtube_data
  for (i in 1:nrow(youtube_data)) {
  
    country <- as.character(youtube_data$country[i])
    created_year <- as.numeric(as.character(youtube_data$created_year[i]))  
    
    # Find the corresponding country row in gdp_data
    gdp_row <- gdp_data[gdp_data$`Country.Name` == country, ]
    created_year_column <- which(names(gdp_row) == created_year)
    
    if (nrow(gdp_row) == 0) {
      adjusted_income[i] <- NA
    } else {
      gdp_values <- as.numeric(gdp_row[ , (created_year_column + 1):64])
      
      value1 <- sum(gdp_values, na.rm = TRUE)
      
      value2 <- 1 / youtube_data$age[i]
      
      adjusted_income[i] <- youtube_data$yearly_earning[i] / value2
    }
  }
  
  youtube_data$channel_age_adjusted_income <- adjusted_income
  
  return(youtube_data)
}


youtube_data <- calculate_channel_age_adjusted_income(youtube_data, gdp_data)

Handle NAs in the channel_age_adjusted_income column

median_value <- median(youtube_data$channel_age_adjusted_income, na.rm = TRUE)


for (i in 1:length(youtube_data$channel_age_adjusted_income)) {
  if (is.na(youtube_data$channel_age_adjusted_income[i])) {
    youtube_data$channel_age_adjusted_income[i] <- median_value
  }
}

Use merge command here.

gdp_2022 <- gdp_data[,c("Country.Name","gdp_per_capita")]

youtube_data <- merge(youtube_data,gdp_2022, by.x = "country",by.y = "Country.Name", all.x = TRUE)
rm(gdp_2022)

If the Country.Name is “Unknown”, then we would use the medium of youtube_data$gdp_per_capita to swap it, it won’t affect our data too much, and also, this is the same handling method that we used before in 3.1.3 Handle Missing value. Consistency is ensured here.

median_gdp <- median(youtube_data$gdp_per_capita,na.rm = TRUE)
youtube_data$gdp_per_capita[is.na(youtube_data$gdp_per_capita)] <- median_gdp
rm(median_gdp)

Normalize Youtuber earning:

We start by using Youtuber yearly earning divided by GDP per capita to observe how many time the Youtuber’s income is the country’s per capita income. Here we use annual income divided by GPD per capita, the reason for using average annual income rather than average monthly income is that anuual income is more stable and reduces fluctuations in income due to seasonality or other short term factors. Secondly, annual income is more objective when compared to annual GDP per capita.

youtube_data$normalized_earning <- as.numeric(youtube_data$yearly_earning/youtube_data$gdp_per_capita)

Logic for defining high income:

We would like to use the mean and add a certain number of standard deviations as a threshold for high income based on the standard deviation, but with this approach we have to first assume that income is normally distributed, which is not always the case in reality.

Validation Definition:

We want to verify that our data is normally distributed by visualizing a graph.

summary(youtube_data$normalized_earning)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   19.47   62.30  489.62  311.13 9438.44

As we can see there is a huge gap between the median and the mean. So we continue with kernel density estimates, and statistical tests, here we used the Shapiro-Wilk test; the Kolmogorov-Smirnov test; and the Anderson-Darling test

#Kernel Density Estimation
plot(density(youtube_data$normalized_earning), main="Kernel Density Estimation of Normalized Earnings", xlab="Normalized Earnings")

# Shapiro-Wilk test
shapiro_test <- shapiro.test(youtube_data$normalized_earning)
print(shapiro_test)
## 
##  Shapiro-Wilk normality test
## 
## data:  youtube_data$normalized_earning
## W = 0.45036, p-value < 2.2e-16
# Kolmogorov-Smirnov test
ks_test <- ks.test(youtube_data$normalized_earning, "pnorm", mean(youtube_data$normalized_earning), sd(youtube_data$normalized_earning))
## Warning in ks.test.default(youtube_data$normalized_earning, "pnorm",
## mean(youtube_data$normalized_earning), : ties should not be present for the
## Kolmogorov-Smirnov test
print(ks_test)
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  youtube_data$normalized_earning
## D = 0.33833, p-value < 2.2e-16
## alternative hypothesis: two-sided
# Anderson-Darling test
ad_test <- ad.test(youtube_data$normalized_earning)
print(ad_test)
## 
##  Anderson-Darling normality test
## 
## data:  youtube_data$normalized_earning
## A = 188.25, p-value < 2.2e-16

Shapiro-Wilk normality test.

The value of W is 0.44508.The value of this statistic ranges from 0 to 1, where 1 indicates that the data perfectly fits a normal distribution. However, the resulting W value is much less than 1, which is an initial indication that the data may not be normally distributed. Moreover, the p-value is very close to 0 (less than 2.2e-16), which means that we should reject the original hypothesis of normality. Kolmogorov-Smirnov test.

The D-value is 0.34033. this is the largest gap between the Earning data distribution and the perfect normal distribution. Again, the p-value is very small (less than 2.2e-16) which further confirms that the data is not normally distributed. The test also gives a warning about the presence of duplicate values (ties) in the data. This is because we used the median instead of the Youtuber for the unknown country.

Anderson-Darling normality test.

The value of A is 191.44. the larger this value, the more the data deviates from the normal distribution. As with the previous two tests, a very small p-value (less than 2.2e-16) indicates that the data are not normally distributed.

Summary: All three tests consistently show that the youtube_data$normalized_earning data does not currently conform to a normal distribution.

Adjustment and Optimization:

While modern machine learning algorithms do not require the data to meet the assumption of a normal distribution. However, bringing the data closer to a normal distribution can improve model performance, especially for models such as linear regression and logistic regression.

We observe some very large values from the density plot, they are much larger than thousands of times the average GDP, and for these values they are simply too different. We will use a logarithmic transformation.The results of data analysis show that after Log-transformation, the data presents different distribution characteristics[9]. More conducive to classification, which is what we learned from class to reduce the effect of these too large values. After that the multiples were adjusted to between 0-10, and according to the Density image, it’s also leaning more towards a normal distribution.

youtube_data$normalized_earning <- log(youtube_data$normalized_earning + 1)  
summary(youtube_data$normalized_earning)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000007 3.018837 4.147872 4.312888 5.743422 9.152652
# Shapiro-Wilk test
shapiro_test <- shapiro.test(youtube_data$normalized_earning)
print(shapiro_test)
## 
##  Shapiro-Wilk normality test
## 
## data:  youtube_data$normalized_earning
## W = 0.98539, p-value = 3.893e-08
# Kolmogorov-Smirnov test
ks_test <- ks.test(youtube_data$normalized_earning, "pnorm", mean(youtube_data$normalized_earning), sd(youtube_data$normalized_earning))
## Warning in ks.test.default(youtube_data$normalized_earning, "pnorm",
## mean(youtube_data$normalized_earning), : ties should not be present for the
## Kolmogorov-Smirnov test
print(ks_test)
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  youtube_data$normalized_earning
## D = 0.035595, p-value = 0.18
## alternative hypothesis: two-sided
# Anderson-Darling test
ad_test <- ad.test(youtube_data$normalized_earning)
print(ad_test)
## 
##  Anderson-Darling normality test
## 
## data:  youtube_data$normalized_earning
## A = 2.0513, p-value = 3.205e-05
# Density plot
plot(density(youtube_data$normalized_earning), main="Kernel Density Estimation of Normalized Earnings", xlab="Normalized Earnings")

rm(ad_test,ks_test,shapiro_test)

Shapiro-Wilk Test.

The w-value is 0.98524, which is close to 1. This is usually a good sign that the data is close to a normal distribution. However, the p-value = 3.358e-08 (very small) is much less than 0.05 Kolmogorov-Smirnov Test.

D-value is 0.038866 which is small. p-value = 0.1134 which is greater than 0.05 Anderson-Darling Test.

A value of 2.2485, large value. p-value = 1.056e-05 (very small), much less than 0.05

youtube_data$normalized_earning is close to a normal distribution, although it is not exactly normal. Many real-world applications have data that do not strictly follow a normal distribution, but if they are close enough, many of the properties of a normal distribution are still useful.

When defining high income based on mean and standard deviation, we choose the mean plus one times the standard deviation: about 84% of the data points will be less than this value. Thus, this would be a relatively loose definition of high income that includes about 16% of YouTubers.

earning_threshold <- median(youtube_data$normalized_earning)+sd(youtube_data$normalized_earning)
print(earning_threshold)
## [1] 6.264244

It means if a Youtuber’s yearly earning is 6.2 times larger than GDP per capita then this Youtuber is a high earning Youtuber.

3.2.2 Tranform categrical variables

Because the first time we built the decision tree model, the model had too many nodes, resulting in overfitting. This is the second time to re-establish the model, so we need to modify the categorical variables and change the categorical variables into factors with fewer levels. This may enhance the generalization ability of the model.

First, Check the number of occurrences of each country

print(table(youtube_data$country))
## 
##          Afghanistan              Andorra            Argentina 
##                    1                    1                   12 
##            Australia           Bangladesh             Barbados 
##                    8                    0                    1 
##               Brazil               Canada                Chile 
##                   61                   15                    3 
##                China             Colombia                 Cuba 
##                    1                   11                    1 
##              Ecuador                Egypt          El Salvador 
##                    2                    2                    1 
##              Finland               France              Germany 
##                    1                    5                    6 
##                India            Indonesia                 Iraq 
##                  165                   28                    2 
##                Italy                Japan               Jordan 
##                    2                    5                    3 
##               Kuwait               Latvia             Malaysia 
##                    1                    1                    1 
##               Mexico              Morocco          Netherlands 
##                   33                    1                    3 
##             Pakistan                 Peru          Philippines 
##                    6                    1                   12 
##               Russia                Samoa         Saudi Arabia 
##                   16                    1                    9 
##            Singapore          South Korea                Spain 
##                    3                   17                   21 
##               Sweden          Switzerland             Thailand 
##                    4                    1                   18 
##               Turkey              Ukraine United Arab Emirates 
##                    4                    8                    7 
##       United Kingdom        United States              Unknown 
##                   42                  307                   92 
##            Venezuela              Vietnam 
##                    1                    3

Turn country variables into factors and then divide them into developed countries, developing countries, and unknown levels.

youtube_data$country_status <- factor(ifelse(youtube_data$country %in% c("Japan","Germany","Canada","United States", "United Kingdom","Denmark", "Australia","China","South Korea", "Switzerland", "Sweden", "Finland", "France", "Norway", "Singapore", "Netherlands", "Italy", "Brazil", "Belgium", "Greece", "Iceland", "Luxembourg", "Spain"), "Developed", ifelse(youtube_data$country == "Unknown", "Unknown", "Developing")))

Second, There are too many levels of channel types, which may also lead to overfitting, so the channel occurrence number is less than or equal to 20 and unknown are classified into the others category.

print(table(youtube_data$channel_type))
## 
##       Animals         Autos        Comedy     Education Entertainment 
##             3             3            50            48           297 
##          Film         Games         Howto         Music          News 
##            41            95            36           208            30 
##     Nonprofit        People        Sports          Tech       Unknown 
##             2            87            11            16            23
print(table(youtube_data$channel_type) < 20)
## 
##       Animals         Autos        Comedy     Education Entertainment 
##          TRUE          TRUE         FALSE         FALSE         FALSE 
##          Film         Games         Howto         Music          News 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##     Nonprofit        People        Sports          Tech       Unknown 
##          TRUE         FALSE          TRUE          TRUE         FALSE
youtube_data$channel_type_status <- factor(ifelse(youtube_data$channel_type %in% c("Animals","Autos","Nonprofit","Sports","Tech","Unknown"),"Others",as.character(youtube_data$channel_type)))

3.2.3 Remove unrelevant columns

According to the instructions of project 2. Columns that have a unique value for each row should be discarded, e.g., rank, Youtuber, etc.List all the columns that have a unique values for each row and then delete those columns.

youtube_data <- youtube_data[
  c(
    "normalized_earning",
    "subscribers","video_views","uploads","channel_type_status","created_year","category",
    "subs_engagement","uploads_frequency","average_views_per_video","urban_rate","age","edu_ratio","top_channel_flag","video_views_increasing_rate","channel_age_adjusted_income","country_status"
    )
  ]

3.2.4 Feature Engineering

We have already created required features in 3.1.8 Section.

“subs_engagement”,“uploads_frequency”,“average_views_per_video”,“urban_rate”,“age”,“edu_ratio”,“top_channel_flag”,“video_views_increasing_rate”

Here, we would use 6.2 times as threshold, build a new binary column, if Youtuber earning is larger than 6.2 return 1, else 0.

youtube_data$earning_class <- ifelse(youtube_data$normalized_earning > 6.2, 1, 0)
youtube_data <- subset(youtube_data, select = -normalized_earning)
names(youtube_data)
##  [1] "subscribers"                 "video_views"                
##  [3] "uploads"                     "channel_type_status"        
##  [5] "created_year"                "category"                   
##  [7] "subs_engagement"             "uploads_frequency"          
##  [9] "average_views_per_video"     "urban_rate"                 
## [11] "age"                         "edu_ratio"                  
## [13] "top_channel_flag"            "video_views_increasing_rate"
## [15] "channel_age_adjusted_income" "country_status"             
## [17] "earning_class"

3.2.5 Training and Testing dataset

According to the different income levels, we need to split the youtube_data into training data and testing data. We do a 20/80 split to form the training and test sets.

After google many articles and learn caret pacakge by ourselves, we decide to use “creatDataPartition” function in caret package, it ensures that the distribution of the target variable is taken into account when partitioning the data.[8]

# Set a seed for reproducibility
set.seed(562816)

train_index <- createDataPartition(youtube_data$earning_class, p = 0.8, list = FALSE, times = 1)

# Subset the original dataset to create training and testing datasets based on the indices generated above.
youtube_train <- youtube_data[train_index, ]
youtube_test  <- youtube_data[-train_index, ]

# Identify and store the names of all independent variables (i.e., all variables excluding the target 'earning_class').
independent_variables <- setdiff(colnames(youtube_data), "earning_class")

# From the list of independent variables, identify and store the names of numeric and integer type variables.
numeric_independent_variables <- independent_variables[sapply(youtube_data[, independent_variables], class) %in% c("numeric", "integer")]

# Similarly, from the list of independent variables, identify and store the names of factor and character type variables.
categorical_independent_variables <- independent_variables[sapply(youtube_data[, independent_variables], class) %in% c("factor", "character")]

Part4 Classification

4.1 Dependent Variable & Independent Variable

Dependent Variable: earning_class

Independent variable: “subscribers”,“video_views”,“uploads”,“channel_type_status”,“created_year”,“category”,“subs_engagement”,“uploads_frequency”,“average_views_per_video” “urban_rate”,“age”,“edu_ratio”,“top_channel_flag”,“video_views_increasing_rate”,“earning_class”,“channel_age_adjusted_income”,“country_status”

4.2 Single Variable Classification

4.2.1 Function build up

Build single variable model function. Use Single variable model for each variable to determine which has the most predictive power.

#Define target column and positive label
target_column <- 'earning_class'
positive_label <- '1'
Single_variable_model <- function(target_column, feature_column, prediction_column){
  
  # Ensure that the input vectors are of the same length
  if (!(length(target_column) == length(feature_column) && length(target_column) == length(prediction_column))) {
    stop("All input vectors must have the same length")
  }

  # Small constant to avoid division by zero
  epsilon <- 1.0e-3
  
  # Calculate the overall probability of the positive class
  probility_of_positive <- sum(target_column == positive_label) / length(target_column)
  
  # For NAs in feature_column, compute the distribution of target_column
  na_table <- table(as.factor(target_column[is.na(feature_column)]))
  
  # For NAs in feature_column, compute the conditional probability of the positive class
  probility_of_positive_with_na <- ifelse(is.na((na_table / sum(na_table))[positive_label]), probility_of_positive, (na_table / sum(na_table))[positive_label])
  
  # Create a table to represent the relationship between different values of feature_column and target_column
  value_table <- table(as.factor(target_column), feature_column, useNA = "ifany")
  
  # Compute conditional probabilities for each value of feature_column
  probility_of_positive_with_val <- (value_table[positive_label, ] + epsilon * probility_of_positive) / (colSums(value_table) + epsilon)
  
  # Generate predictions based on conditional probabilities
  predictions <- probility_of_positive_with_val[prediction_column]
  
  # Handle NA values in prediction_column
  predictions[is.na(prediction_column)] <- probility_of_positive_with_na
  
  # Handle values in prediction_column that didn't appear in the training data
  predictions[is.na(predictions)] <- probility_of_positive
  
  # Return predictions
  return(predictions)
}

Build up AUC calculator function.

AUC_calculator <- function(predcol, outcol) {
    perf <- performance(prediction(predcol, outcol == positive_label), 'auc')
    return(as.numeric(perf@y.values))
}
# Create a function to discretize numeric columns
discretizeVariable <- function(column) {
    quantiles <- quantile(column, probs = seq(0, 1, 0.1), na.rm = TRUE)
    discrete_column <- cut(column, unique(quantiles))
    return(discrete_column)
}
# single variable prediction function
SingleVariablePredictNumeric <- function(target_column, feature_column, prediction_column, subset) {
    feature_discrete <- discretizeVariable(feature_column[subset])
    prediction_discrete <- discretizeVariable(prediction_column[subset])
    return(Single_variable_model(target_column[subset], feature_discrete, prediction_discrete))
}

4.2.2 Analyze Variables

4.2.2.1 Analyzing Categorical variables:

Call Single variable model & AUC calculator function, check AUC value for categorical independent variables.

for (var in categorical_independent_variables) {
    prediction_probs <- Single_variable_model(target_column = youtube_train$earning_class, 
                                             feature_column = youtube_train[, var],
                                             prediction_column = youtube_train[, var])
    auc <- AUC_calculator(prediction_probs, youtube_train$earning_class)
    if (auc >= 0.1) {
        print(sprintf("%s - AUC: %4.3f", var, auc))
    }
}
## [1] "channel_type_status - AUC: 0.666"
## [1] "category - AUC: 0.691"
## [1] "country_status - AUC: 0.852"

4.2.2.2 Analyzing Numeric variables:

Check the AUC value for numeric independent variables. We need to change the numeric data into category data first as the instruction in Week9 Slides.

for (var in numeric_independent_variables) {
    pred_col_name <- paste('pred', var, sep = '_')
    youtube_train[, pred_col_name] <- SingleVariablePredictNumeric(youtube_train$earning_class, youtube_train[, var], youtube_train[, var])
    auc_train <- AUC_calculator(youtube_train[, pred_col_name], youtube_train$earning_class)
    if (auc_train >= 0.55) {
        print(sprintf("%s: AUC: %4.3f", var, auc_train))
    }
}
## [1] "subscribers: AUC: 0.649"
## [1] "video_views: AUC: 0.697"
## [1] "uploads: AUC: 0.711"
## [1] "created_year: AUC: 0.586"
## [1] "subs_engagement: AUC: 0.692"
## [1] "uploads_frequency: AUC: 0.749"
## [1] "average_views_per_video: AUC: 0.651"
## [1] "urban_rate: AUC: 0.899"
## [1] "age: AUC: 0.571"
## [1] "edu_ratio: AUC: 0.885"
## [1] "top_channel_flag: AUC: 0.552"
## [1] "video_views_increasing_rate: AUC: 0.788"
## [1] "channel_age_adjusted_income: AUC: 0.807"

Using 100-fold cross-validation

vars <- numeric_independent_variables 

for (var in vars) {
    aucs <- rep(0, 100)
    
    for (rep in 1:length(aucs)) {
        useForCalRep <- rbinom(n=nrow(youtube_train), size=1, prob=0.1) > 0
        
        predRep <- SingleVariablePredictNumeric(youtube_train$earning_class, 
                                                youtube_train[, var], 
                                                youtube_train[, var], 
                                                !useForCalRep)  # Pass the subset to the function
        
        # Only calculate AUC for the subset used for prediction
        actual_subset_labels = youtube_train$earning_class[!useForCalRep]
        aucs[rep] <- AUC_calculator(predRep, actual_subset_labels)
    }
    
    print(sprintf("%s: mean: %4.3f; sd: %4.3f", var, mean(aucs), sd(aucs)))
}
## [1] "subscribers: mean: 0.648; sd: 0.009"
## [1] "video_views: mean: 0.701; sd: 0.009"
## [1] "uploads: mean: 0.714; sd: 0.008"
## [1] "created_year: mean: 0.589; sd: 0.009"
## [1] "subs_engagement: mean: 0.693; sd: 0.009"
## [1] "uploads_frequency: mean: 0.745; sd: 0.008"
## [1] "average_views_per_video: mean: 0.659; sd: 0.008"
## [1] "urban_rate: mean: 0.900; sd: 0.007"
## [1] "age: mean: 0.576; sd: 0.010"
## [1] "edu_ratio: mean: 0.884; sd: 0.007"
## [1] "top_channel_flag: mean: 0.552; sd: 0.002"
## [1] "video_views_increasing_rate: mean: 0.791; sd: 0.008"
## [1] "channel_age_adjusted_income: mean: 0.811; sd: 0.006"

Urban_rate (0.899) and edu_ratio (0.884) are still the best performing predictive features. These two features perform very well in predicting the target variable.

The features video_views_increasing_rate (0.790) and channel_age_adjusted_income (0.810) are still better features with strong predictive power.

The AUC values of the two features video_views (0.700), subs_engagement (0.692) and average_views_per_video (0.659) are still relatively high, showing moderate predictive power.

The AUC value of feature uploads (0.714) increased slightly, but is still within the previously mentioned range.

The three AUC values of subscribers (0.648), created_year (0.589) and age (0.578) changed slightly, but are relatively low. They are still better than random predictions, but less predictive.

The feature top_channel_flag (0.552) has the lowest AUC value, which indicates that it may contribute less to predicting the target variable.

4.2.3 ROC plot

ROC plots are displayed in the last part of shiny

library(ROCit)

plot_roc <- function(predcol, outcol, colour_id=2, overlaid=FALSE) {
    ROCit_obj <- rocit(score = predcol, class = outcol == positive_label)
    par(new = overlaid)
    plot(ROCit_obj, col = c(colour_id, 1), legend = FALSE, YIndex = FALSE, values = FALSE)
}

double density comparing AUC

# Compare the AUC of the first two categorical variables
var1 <- categorical_independent_variables[1]
var2 <- categorical_independent_variables[2]

prediction_probs1 <- Single_variable_model(target_column = youtube_train$earning_class, 
                                          feature_column = youtube_train[, var1],
                                          prediction_column = youtube_train[, var1])

prediction_probs2 <- Single_variable_model(target_column = youtube_train$earning_class, 
                                          feature_column = youtube_train[, var2],
                                          prediction_column = youtube_train[, var2])

fig1 <- ggplot(youtube_train) + geom_density(aes(x = prediction_probs1, color = as.factor(earning_class)))
fig2 <- ggplot(youtube_train) + geom_density(aes(x = prediction_probs2, color = as.factor(earning_class)))

grid.arrange(fig1, fig2, ncol=2)

4.3 Null model

4.3.1 Build Null Model

According to the knowledge of lectures, If the AUC value of a single variable model is close to or lower than the AUC value of the null model, it means that the single variable model may not provide enough information for predicting the target variable. In this case, using this single variable model may not have a significant impact on model performance.By analyzing only the target variable, a null model is created and the AUC of the model is evaluated. The AUC value of the null model can help comparing single variable.

# Calculate the number of positive class instances (Npos)
Npos <- sum(youtube_train[, target_column] == 1)
cat("Number of positive class (target_column == 1) in youtube_train:", Npos, "rows\n")
## Number of positive class (target_column == 1) in youtube_train: 152 rows
# Calculate the number of negative class instances (Nneg)
Nneg <- nrow(youtube_train) - Npos
cat("Number of negative class (target_column == 0) in youtube_train:", Nneg, "rows\n")
## Number of negative class (target_column == 0) in youtube_train: 608 rows
# Calculate the Null Model prediction, which is the proportion of positive class
pred.Null <- Npos / (Npos + Nneg)
cat("Proportion of target_column == 1 in youtube_train (Null Model prediction):", pred.Null, "\n")
## Proportion of target_column == 1 in youtube_train (Null Model prediction): 0.2
# Create a vector of Null Model predictions with the same length as the dataset
null_model_predictions <- rep(pred.Null, nrow(youtube_train))

# Calculate the Null Model AUC using the pROC package
library(pROC) 
roc_obj <- roc(youtube_train[, target_column], null_model_predictions)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
null_model_auc <- auc(roc_obj)

cat("Null Model AUC:", null_model_auc, "\n")
## Null Model AUC: 0.5

4.4 Feature Selection

We will use different methods to divide features into different combinations for modeling. The most basic combination is all features

feature_combine1 <- c(categorical_independent_variables,numeric_independent_variables)

4.4.1 Likelihood and deviance

First calculate the log likelihood of the null model as the standard

positive_label <- '1'
# Define a function to compute log likelihood 
logLikelihood <- function(ytrue, ypred, epsilon=1e-6) {
                          sum(ifelse(ytrue==positive_label, log(ypred+epsilon), log(1-ypred-epsilon)), na.rm=T)
}
# Compute the likelihood of the Null model on the Training model
logNull <- logLikelihood(youtube_train[,target_column], sum(youtube_train[,target_column]==positive_label)/nrow(youtube_train))
cat('NULL model\'s log likelihood is',logNull)
## NULL model's log likelihood is -380.3058

4.4.1.1 Deviance of Categorical Variables

# By adjusting the value of minDrop, we can control how strictly the variables are selected
selCatVars <- c()
minDrop <- 10 
for (v in categorical_independent_variables) {
    pi <- paste( v, sep='')
    binary_var <- ifelse(youtube_train$v == "desired_level", 1, 0)
    devDrop <- 2 * (logLikelihood(youtube_train[, target_column], binary_var) - logNull)
    if (devDrop >= minDrop) {
      print(sprintf("%s, deviance reduction: %g", pi, devDrop))
      selCatVars <- c(selCatVars, pi)
 }
}
## [1] "channel_type_status, deviance reduction: 760.612"
## [1] "category, deviance reduction: 760.612"
## [1] "country_status, deviance reduction: 760.612"

The deviance values of the two categorical variables are larger than the null model, indicating that these two features are suitable for model prediction under this method.

4.3.2.2 Deviance of Numeric Variables

selNumVars <- c()
minDrop <- 10 
for (v in numeric_independent_variables) {
  pi <- paste(v, sep='')
  epsilon <- 1e-6  # Small constant to avoid division by zero
  devDrop <- 2 * (logLikelihood(youtube_train[, target_column], youtube_train[, pi], epsilon) - logNull)
  if (devDrop >= minDrop) {
    print(sprintf("%s, deviance reduction: %g", pi, devDrop))
    selNumVars <- c(selNumVars, pi)
 }
}
## [1] "subscribers, deviance reduction: 5907.24"
## [1] "video_views, deviance reduction: 7797.81"
## [1] "uploads, deviance reduction: 3222.69"
## [1] "created_year, deviance reduction: 3073.26"
## [1] "subs_engagement, deviance reduction: 2647.69"
## [1] "uploads_frequency, deviance reduction: 2527.6"
## [1] "average_views_per_video, deviance reduction: 5335.74"
## [1] "age, deviance reduction: 1386.69"
## [1] "top_channel_flag, deviance reduction: 760.612"
## [1] "channel_age_adjusted_income, deviance reduction: 6086.11"

Set the value of minDrop to 0 and traverse all numerical variables. There are originally 12 numerical variables, but only 9 have results. This means that in this selection method, only the deviance values of these 9 variables are better than the null model. Large, we can choose to do model predictions. “deviance reduction” indicates the contribution of each feature in the model. The higher the contribution, the stronger the feature’s ability to explain the model. Therefore, variables with high contribution are selected as feature combination 2.In feature combination 2, the factor variables only use channel_type_status and country_status since they have less layers which are benefit for modeling

feature_combine2 <- c("channel_type_status","country_status","video_views","average_views_per_video","subscribers","channel_age_adjusted_income")

4.4.2 Chi-Square Test

analyze_features <- function(data, target_column, categorical_features, numeric_features) {
  results <- data.frame(Feature = character(0), Analysis = character(0), P_Value = numeric(0))
  
  # Analyze categorical features
  for (feature in categorical_features) {
    cross_table <- table(data[, feature], data[, target_column])
    chi_square <- chisq.test(cross_table)
    
    result <- data.frame(
      Feature = feature,
      Analysis = "Chi-Square Test",
      P_Value = chi_square$p.value
    )
    
    results <- rbind(results, result)
  }
  
  # Analyze numerical features
  for (feature in numeric_features) {
    anova_result <- aov(data[, target_column] ~ data[, feature])
    
    result <- data.frame(
      Feature = feature,
      Analysis = "ANOVA",
      P_Value = summary(anova_result)[[1]][["Pr(>F)"]]
    )
    
    results <- rbind(results, result)
  }
  
  return(results)
}

results <- analyze_features(youtube_data, target_column, categorical_independent_variables, numeric_independent_variables)

print(results)
##                        Feature        Analysis       P_Value
## 1          channel_type_status Chi-Square Test  8.507498e-14
## 2                     category Chi-Square Test  1.741224e-13
## 3               country_status Chi-Square Test  7.441636e-67
## 4                  subscribers           ANOVA  1.715400e-06
## 5                  subscribers           ANOVA            NA
## 6                  video_views           ANOVA  8.303273e-15
## 7                  video_views           ANOVA            NA
## 8                      uploads           ANOVA  9.682326e-30
## 9                      uploads           ANOVA            NA
## 10                created_year           ANOVA  1.006958e-01
## 11                created_year           ANOVA            NA
## 12             subs_engagement           ANOVA  8.612474e-10
## 13             subs_engagement           ANOVA            NA
## 14           uploads_frequency           ANOVA  7.003394e-30
## 15           uploads_frequency           ANOVA            NA
## 16     average_views_per_video           ANOVA  3.617295e-02
## 17     average_views_per_video           ANOVA            NA
## 18                  urban_rate           ANOVA 2.307276e-106
## 19                  urban_rate           ANOVA            NA
## 20                         age           ANOVA  1.006958e-01
## 21                         age           ANOVA            NA
## 22                   edu_ratio           ANOVA  3.674466e-01
## 23                   edu_ratio           ANOVA            NA
## 24            top_channel_flag           ANOVA  3.627094e-06
## 25            top_channel_flag           ANOVA            NA
## 26 video_views_increasing_rate           ANOVA  6.218954e-01
## 27 video_views_increasing_rate           ANOVA            NA
## 28 channel_age_adjusted_income           ANOVA  4.446655e-38
## 29 channel_age_adjusted_income           ANOVA            NA

According to the result analysis, the smaller the p value, the stronger the correlation between the feature and the independent variable. When the p value is less than 0.05, which is 5e-2, it indicates that the feature has a strong correlation with the independent variable. Filter out strongly correlated features as feature combination 3

feature_combine3 <- c("category","subscribers","created_year","average_views_per_video","urban_rate","age","edu_ratio","top_channel_flag","channel_age_adjusted_income")

4.5 Multi-Variable Classification

4.5.1 Build the model performacne measure function and plotting ROC function

4.5.1.1 Calculate the Model’s accuracy, precision, recall, and f1 score

Use different evaluation methods to evaluate the representation of the decision tree model

# ytrue should be a vector containing 1s (or TRUE) and 0s (or FALSE);
# ypred should be a vector containing the predicted probability values for the target class.
# Both ytrue and ypred should have the same length.
performanceMeasures <- function(ytrue, ypred, model.name = "model", threshold=0.5) {
      # compute the normalised deviance
      dev.norm <- -2 * logLikelihood(ytrue, ypred)/length(ypred)
      # compute the confusion matrix
      confmatrix <- table(actual = ytrue, predicted = ypred >= threshold)
      accuracy <- sum(diag(confmatrix)) / sum(confmatrix)
      precision <- confmatrix[2, 2] / sum(confmatrix[, 2])
      recall <- confmatrix[2, 2] / sum(confmatrix[2, ])
      f1 <- 2 * precision * recall / (precision + recall)
      data.frame(model = model.name, precision = precision, recall = recall, f1 = f1, dev.norm = dev.norm)
}

# pander formating
panderOpt <- function(){
    library(pander)
    # setting up Pander Options
    panderOptions("plain.ascii", TRUE)
    panderOptions("keep.trailing.zeros", TRUE)
    panderOptions("table.style", "simple")
}

# Prettier Performance Table Function
pretty_perf_table <- function(model, xtrain, ytrain,
xtest, ytest, threshold=0.5) {
    # Option setting for Pander
    panderOpt()
    perf_justify <- "lrrrr"
    
    # call the predict() function to do the predictions
    pred_train <- predict(model, newdata=xtrain)
    pred_test <- predict(model, newdata=xtest)
    
    # comparing performance on training vs. test
    trainperf_df <- performanceMeasures(
    ytrain, pred_train, model.name=paste(substitute(model), ", training"), threshold=threshold)
    testperf_df <- performanceMeasures(
    ytest, pred_test, model.name=paste(substitute(model), ", testing"), threshold=threshold)
    
    # combine the two performance data frames using rbind()
    perftable <- rbind(trainperf_df, testperf_df)
    pandoc.table(perftable, justify = perf_justify)
}
# calculate each models' LogLikelihood value
calculateLogLikelihood <- function(model, x, y, epsilon = 1e-6) {

  predicted_probabilities <- predict(model, newdata = x)
  
  log_likelihood <- sum(ifelse(y == 1, log(predicted_probabilities + epsilon), log(1 - predicted_probabilities + epsilon)))
  
  return(log_likelihood)
}

4.5.1.2 Plot the ROC

Build a function to generate the ROC plot

library(ROCit)
plot_roc2 <- function(predcol1, outcol1, predcol2, outcol2){
    roc_1 <- rocit(score=predcol1, class=outcol1==positive_label)
    roc_2 <- rocit(score=predcol2, class=outcol2==positive_label)
    plot(roc_1, col = c("blue","green"), lwd = 3,
    legend = FALSE,YIndex = FALSE, values = TRUE, asp=1)
    lines(roc_2$TPR ~ roc_2$FPR, lwd = 3,
    col = c("red","green"), asp=1)
    legend("bottomright", col = c("blue","red", "green"),
    c("Test Data", "Training Data", "Null Model"), lwd = 2)
}
plot_roc3 <- function(predcol, outcol) {
    roc <- rocit(score = predcol, class = outcol == positive_label)
    plot(roc, col = c("blue", "green"), lwd = 3,
         legend = FALSE, YIndex = FALSE, values = TRUE, asp = 1)
    legend("bottomright", col = c("blue", "green"),
           c("Test Data", "Training Data"), lwd = 2)
}

4.5.2 Descision Tree Classifier

Interpret decision tree model:

Considering the wide readership, we will explain how the decision tree classifier works. A decision tree classifier is an intelligent algorithm that is like a map of answers to a question. It uses a series of questions to determine which category something belongs to. Imagine a tree where the root is the first question and each leaf node represents an answer and the path is determined by answering the question. For example, if we want to differentiate between fruits, the first question might be “Is this fruit round?” If so, go down one branch and ask the next question, and if not, go down another branch. Through a series of questions and answers, the type of fruit was finally determined. This tree structure is very intuitive, making decision tree classifiers easy to understand and use, and suitable for various applications such as disease diagnosis, customer classification, etc.

4.5.2.1 Build Descision Tree Model for all features

First build a model based on all features and observe the performance of the model.

formula_Var1 <- paste(target_column,'> 0 ~ ',paste(c(categorical_independent_variables,numeric_independent_variables), collapse=' + '), sep='')
tmodel1 <- rpart(formula_Var1, data=youtube_train)

# visualization decision tree 
rpart.plot(tmodel1)

Check the Model Performance

# train set AUC score
tmodel1_train_auc <- AUC_calculator(predict(tmodel1, newdata=youtube_train), youtube_train[,target_column])
cat('tmodel1\'s AUC score on the training set is',tmodel1_train_auc, "\n")
## tmodel1's AUC score on the training set is 0.9926744
# test set AUC score
tmodel1_test_auc <- AUC_calculator(predict(tmodel1, newdata=youtube_test), youtube_test[,target_column])
cat('tmodel1\'s AUC score on the testing set is',tmodel1_test_auc)
## tmodel1's AUC score on the testing set is 0.9734163
# performance table
pretty_perf_table(tmodel1,youtube_train[c(categorical_independent_variables,numeric_independent_variables)],youtube_train[,target_column] == positive_label,youtube_test[c(categorical_independent_variables,numeric_independent_variables)],youtube_test[,target_column] == positive_label)
## 
## 
## model                  precision   recall       f1   dev.norm
## -------------------- ----------- -------- -------- ----------
## tmodel1 , training        0.9036   0.9868   0.9434     1.1056
## tmodel1 , testing         0.8684   0.9706   0.9167     0.9334
# log likelihood
cat("tmodel1 Log Likelihood:", calculateLogLikelihood(tmodel1, youtube_test, positive_label, epsilon = 1e-6), "\n")
## tmodel1 Log Likelihood: -0.02955777
cat("NULL model Log Likelihood:",logNull)
## NULL model Log Likelihood: -380.3058
#Print the ROC plot
pred_test_roc_tmodel1 <- predict(tmodel1, newdata=youtube_test)
pred_train_roc_tmodel1 <- predict(tmodel1, newdata=youtube_train)
plot_roc2(pred_test_roc_tmodel1, youtube_test[[target_column]],
         pred_train_roc_tmodel1, youtube_train[[target_column]])

The ‘Decision Tree with all features’ model assigns a log likelihood of -0.02955777 which is better than the Null model’s -380.3058. Although the model performs well in terms of AUC and Recall, the Precision is relatively low, indicating that there is a certain number of false positives in cases where the model predicts a positive class. This may cause the model to generate more false positives in practical applications, reducing its reliability and operability. Moreover, the AUC value is too large, and there is a risk of overfitting.

4.5.2.2 Build Descision Tree Model for feature combination 2

Build the model for feature combination 2

formula_Var2 <- paste(target_column,'> 0 ~ ', paste(feature_combine2, collapse=' + '), sep='')
tmodel2 <- rpart(formula_Var2, data=youtube_train)

# visualization decision tree 
rpart.plot(tmodel2)

Check the Model Performance

# train set AUC score
tmodel2_train_auc <- AUC_calculator(predict(tmodel2, newdata=youtube_train), youtube_train[,target_column])
cat('tmodel2\'s AUC score on the training set is',tmodel2_train_auc, "\n")
## tmodel2's AUC score on the training set is 0.9397236
# test set AUC score
tmodel2_test_auc <- AUC_calculator(predict(tmodel2, newdata=youtube_test), youtube_test[,target_column])
cat('tmodel2\'s AUC score on the testing set is',tmodel2_test_auc)
## tmodel2's AUC score on the testing set is 0.9523002
# performance table
pretty_perf_table(tmodel2,youtube_train[feature_combine2],youtube_train[,target_column] == positive_label,youtube_test[feature_combine2],youtube_test[,target_column] == positive_label)
## 
## 
## model                  precision   recall       f1   dev.norm
## -------------------- ----------- -------- -------- ----------
## tmodel2 , training        0.8696   0.7895   0.8276      1.154
## tmodel2 , testing         0.7179   0.8235   0.7671      1.233
# log likelihood
cat("tmodel2 Log Likelihood:", calculateLogLikelihood(tmodel2, youtube_test, positive_label, epsilon = 1e-6), "\n")
## tmodel2 Log Likelihood: -2.079434
cat("NULL model Log Likelihood:",logNull)
## NULL model Log Likelihood: -380.3058
# Print the ROC plot
pred_test_roc_tmodel2 <- predict(tmodel2, newdata=youtube_test)
pred_train_roc_tmodel2 <- predict(tmodel2, newdata=youtube_train)
plot_roc2(pred_test_roc_tmodel2, youtube_test[[target_column]],
         pred_train_roc_tmodel2, youtube_train[[target_column]])

The ‘Decision Tree with feature combination 2’ model assigns a log likelihood of -2.079434 which is better than the Null model’s -380.3058. The model has an AUC of 0.94, indicating strong discriminative power. However, it exhibits suboptimal performance in terms of Precision (0.8696) and Recall (0.7895). The Precision value suggests that the model tends to produce false positives, potentially leading to a higher rate of misclassifications. Although Recall shows some improvement compared to Precision, the F1 score of 0.8276 suggests an imbalance between Precision and Recall. The model’s Deviance Normalization value of 1.154 indicates that it could be further fine-tuned for improved calibration. In summary, while the model has good discriminatory ability (AUC), it suffers from imbalanced Precision and Recall, requiring optimization to achieve a more balanced trade-off between false positives and false negatives. -380.3058.

4.5.2.3 Build Descision Tree Model for feature combination 3

Build the model

formula_Var3 <- paste(target_column,'> 0 ~ ', paste(feature_combine3, collapse=' + '), sep='')
tmodel3 <- rpart(formula_Var3, data=youtube_train)

# visualization decision tree
rpart.plot(tmodel3)

Check the Model Performance

# train set AUC score 
tmodel3_train_auc <- AUC_calculator(predict(tmodel3, newdata=youtube_train), youtube_train[,target_column])
cat('tmodel3\'s AUC score on the training set is',tmodel3_train_auc, "\n")
## tmodel3's AUC score on the training set is 0.971915
# test set AUC score
tmodel3_test_auc <- AUC_calculator(predict(tmodel3, newdata=youtube_test), youtube_test[,target_column])
cat('tmodel3\'s AUC score on the testing set is',tmodel3_test_auc)
## tmodel3's AUC score on the testing set is 0.950132
# performance table
pretty_perf_table(tmodel3,youtube_train[feature_combine3],youtube_train[,target_column] == positive_label,youtube_test[feature_combine3],youtube_test[,target_column] == positive_label)
## 
## 
## model                  precision   recall       f1   dev.norm
## -------------------- ----------- -------- -------- ----------
## tmodel3 , training        0.9281   0.9342   0.9311      1.189
## tmodel3 , testing         0.8421   0.9412   0.8889      1.198
# log likelihood
cat("tmodel3 Log Likelihood:", calculateLogLikelihood(tmodel3, youtube_test, positive_label, epsilon = 1e-6), "\n")
## tmodel3 Log Likelihood: -0.02955777
cat("NULL model Log Likelihood:",logNull)
## NULL model Log Likelihood: -380.3058
# Print the ROC plot
pred_test_roc_tmodel3 <- predict(tmodel3, newdata=youtube_test)
pred_train_roc_tmodel3 <- predict(tmodel3, newdata=youtube_train)
plot_roc2(pred_test_roc_tmodel3, youtube_test[[target_column]],
         pred_train_roc_tmodel3, youtube_train[[target_column]])

The ‘Decision Tree with feature combination 3’ model assigns a log likelihood of -2.079434 which is better than the Null model’s -380.3058. With an impressive AUC value of 0.9719, the ‘tmodel3’ model showcases remarkable discriminatory power. Its strong Precision (0.9281) suggests that it minimizes false positives, while its high Recall (0.9342) indicates a balanced ability to capture true positives. This equilibrium is reflected in the F1 score of 0.9311, denoting a harmonious trade-off between Precision and Recall. The Deviance Normalization value of 1.189 implies some room for calibration enhancement, but overall, this model demonstrates exceptional performance in classifying the target variable.

4.5.3 Logistic Regression classifier

The decision to employ logistic regression as the classification model is driven by its suitability for predictive tasks. Logistic regression is a linear model with interpretable coefficients that reveal feature impacts on the target variable. It automatically selects pertinent features, while unimportant ones can be minimized with regularization [10]. In our binary classification task (1 for high income, 0 for low income), binary logistic regression is an ideal choice, offering both interpretability and relevance to the problem. It aids in understanding the significance of features, simplifies the model, and effectively predicts income levels.

4.5.3.1 Build Logistic Regression model for all features

lrmodel1 <- glm(as.formula(paste("earning_class ~ ", paste(feature_combine1, collapse = ' + '))), data = youtube_train, family = "binomial")

Check the Model Performance

# train set AUC score 
lrmodel1_train_auc <- AUC_calculator(predict(lrmodel1, newdata=youtube_train), youtube_train[,target_column])
cat('lrmodel1\'s AUC score on the training set is',lrmodel1_train_auc, "\n")
## lrmodel1's AUC score on the training set is 0.9781315
# test set AUC score
lrmodel1_test_auc <- AUC_calculator(predict(lrmodel1, newdata=youtube_test), youtube_test[,target_column])
cat('lrmodel1\'s AUC score on the testing set is',lrmodel1_test_auc)
## lrmodel1's AUC score on the testing set is 0.9485294
# performance table
pretty_perf_table(lrmodel1,youtube_train[feature_combine1],youtube_train[,target_column] == positive_label,youtube_test[feature_combine1],youtube_test[,target_column] == positive_label)
## 
## 
## model                   precision   recall       f1   dev.norm
## --------------------- ----------- -------- -------- ----------
## lrmodel1 , training        0.9242   0.8026   0.8592     -3.304
## lrmodel1 , testing         0.8438   0.7941   0.8182     -3.195
# log likelihood
cat("lrmodel1 Log Likelihood:", calculateLogLikelihood(lrmodel1, youtube_test, positive_label, epsilon = 1e-6), "\n")
## lrmodel1 Log Likelihood: NaN
cat("NULL model Log Likelihood:",logNull)
## NULL model Log Likelihood: -380.3058
# print ROC plot
pred_test_roc_lrmodel1 <- predict(lrmodel1, newdata=youtube_test)
pred_train_roc_lrmodel1 <- predict(lrmodel1, newdata=youtube_train)
plot_roc2(pred_test_roc_lrmodel1, youtube_test[[target_column]],
         pred_train_roc_lrmodel1, youtube_train[[target_column]])

The ‘Logistic Regression with all features’ model has a high AUC value, indicating that the model has a strong ability to distinguish between positive and negative categories. The accuracy is 0.924, which is high, indicating that the model can accurately predict the positive category, but There is also the possibility of false positives, and the log-likelihood value of the model has no calculated results, so the prediction results of the model are not very good.

4.5.3.2 Build Logistic Regression model for feature combination 2

lrmodel2 <- glm(as.formula(paste("earning_class ~ ", paste(feature_combine2, collapse = ' + '))), data = youtube_train, family = "binomial")

Check the Model Performance

# train set AUC score 
lrmodel2_train_auc <- AUC_calculator(predict(lrmodel2, newdata=youtube_train), youtube_train[,target_column])
cat('lrmodel2\'s AUC score on the training set is',lrmodel2_train_auc, "\n")
## lrmodel2's AUC score on the training set is 0.9497706
# test set AUC score
lrmodel2_test_auc <- AUC_calculator(predict(lrmodel2, newdata=youtube_test), youtube_test[,target_column])
cat('lrmodel2\'s AUC score on the testing set is',lrmodel2_test_auc)
## lrmodel2's AUC score on the testing set is 0.9457014
# performance table
pretty_perf_table(lrmodel2,youtube_train[feature_combine2],youtube_train[,target_column] == positive_label,youtube_test[feature_combine2],youtube_test[,target_column] == positive_label)
## 
## 
## model                   precision   recall       f1   dev.norm
## --------------------- ----------- -------- -------- ----------
## lrmodel2 , training        0.8911   0.5921   0.7115     -2.509
## lrmodel2 , testing         0.7500   0.6176   0.6774     -2.174
# log likelihood
cat("lrmodel2 Log Likelihood:", calculateLogLikelihood(lrmodel2, youtube_test, positive_label, epsilon = 1e-6), "\n")
## lrmodel2 Log Likelihood: NaN
cat("NULL model Log Likelihood:",logNull)
## NULL model Log Likelihood: -380.3058
# print ROC plot
pred_test_roc_lrmodel2 <- predict(lrmodel2, newdata=youtube_test)
pred_train_roc_lrmodel2 <- predict(lrmodel2, newdata=youtube_train)
plot_roc2(pred_test_roc_lrmodel2, youtube_test[[target_column]],
         pred_train_roc_lrmodel2, youtube_train[[target_column]])

Although the AUC value of The ‘Logistic Regression with feature combination 2’ model is not bad at 0.949, the precision is a bit low at only 0.89, which means that false positives are more likely to occur. The recall rate is also a bit low at only 0.59, which means that the model cannot capture it correctly. Most of the positive categories, and the log-likelihood value cannot be calculated, so the model cannot predict well.

4.5.3.3 Build Logistic Regression model for feature combination 3

lrmodel3 <- glm(as.formula(paste("earning_class ~ ", paste(feature_combine3, collapse = ' + '))), data = youtube_train, family = "binomial")

Check the Model Performance

# train set AUC score 
lrmodel3_train_auc <- AUC_calculator(predict(lrmodel3, newdata=youtube_train), youtube_train[,target_column])
cat('lrmodel3\'s AUC score on the training set is',lrmodel3_train_auc, "\n")
## lrmodel3's AUC score on the training set is 0.9617599
# test set AUC score
lrmodel3_test_auc <- AUC_calculator(predict(lrmodel3, newdata=youtube_test), youtube_test[,target_column])
cat('lrmodel3\'s AUC score on the testing set is',lrmodel3_test_auc)
## lrmodel3's AUC score on the testing set is 0.9144042
# performance table
pretty_perf_table(lrmodel3,youtube_train[feature_combine3],youtube_train[,target_column] == positive_label,youtube_test[feature_combine3],youtube_test[,target_column] == positive_label)
## 
## 
## model                   precision   recall       f1   dev.norm
## --------------------- ----------- -------- -------- ----------
## lrmodel3 , training        0.8780   0.7105   0.7855     -2.850
## lrmodel3 , testing         0.7742   0.7059   0.7385     -2.873
# log likelihood
cat("lrmodel3 Log Likelihood:", calculateLogLikelihood(lrmodel3, youtube_test, positive_label, epsilon = 1e-6), "\n")
## lrmodel3 Log Likelihood: NaN
cat("NULL model Log Likelihood:",logNull)
## NULL model Log Likelihood: -380.3058
# print ROC plot
pred_test_roc_lrmodel3 <- predict(lrmodel3, newdata=youtube_test)
pred_train_roc_lrmodel3 <- predict(lrmodel3, newdata=youtube_train)
plot_roc2(pred_test_roc_lrmodel3, youtube_test[[target_column]],
         pred_train_roc_lrmodel3, youtube_train[[target_column]])

While the ‘Logistic Regression with all features’ model demonstrates a strong AUC value of 0.961, its precision is relatively lower at 0.897. This lower precision suggests a heightened likelihood of false positive predictions. Furthermore, the model exhibits a moderate recall rate of 0.71, which means it captures some positive cases correctly but may miss a substantial portion. It’s important to note that the model’s log-likelihood value cannot be calculated, indicating potential limitations in its predictive capabilities. Overall, while the AUC score is promising, the model’s imbalanced precision and recall rates suggest room for improvement in its predictive performance.

4.5.4 SVM classifier

Because the SVM classifier can only be used to analyze numerical features, only numerical variables are used as model features in this classifier method[11].

library(e1071)
svm_model <- svm(as.formula(paste("earning_class ~ ", paste(numeric_independent_variables, collapse = ' + '))), data = youtube_train, kernel = "radial")

Check the Model Performance

# train set AUC score 
svm_model_train_auc <- AUC_calculator(predict(svm_model, newdata=youtube_train), youtube_train[,target_column])
cat('svm_model\'s AUC score on the training set is',svm_model_train_auc, "\n")
## svm_model's AUC score on the training set is 0.9331285
# test set AUC score
svm_model_test_auc <- AUC_calculator(predict(svm_model, newdata=youtube_test), youtube_test[,target_column])
cat('svm_model\'s AUC score on the testing set is',svm_model_test_auc)
## svm_model's AUC score on the testing set is 0.9040347
# performance table
pretty_perf_table(svm_model,youtube_train[numeric_independent_variables],youtube_train[,target_column] == positive_label,youtube_test[numeric_independent_variables],youtube_test[,target_column] == positive_label)
## 
## 
## model                    precision   recall       f1   dev.norm
## ---------------------- ----------- -------- -------- ----------
## svm_model , training        0.9244   0.7237   0.8118     0.6529
## svm_model , testing         0.8214   0.6765   0.7419     0.5674
# log likelihood
cat("svm_model Log Likelihood:", calculateLogLikelihood(svm_model, youtube_test, positive_label, epsilon = 1e-6), "\n")
## svm_model Log Likelihood: -1.312188
cat("NULL model Log Likelihood:",logNull)
## NULL model Log Likelihood: -380.3058
# print ROC plot
pred_test_roc_svm_model <- predict(svm_model, newdata=youtube_test)
pred_train_roc_svm_model <- predict(svm_model, newdata=youtube_train)
plot_roc2(pred_test_roc_svm_model, youtube_test[[target_column]],
         pred_train_roc_svm_model, youtube_train[[target_column]])

The ‘SVM for all numeric features’ model displays a good ability to differentiate between classes with an AUC of 0.9331. While achieving a high precision of 0.9244, suggesting limited false positives, it has a moderate recall of 0.7237, indicating some potential for missing positive cases. The Log Likelihood value of -1.3122 is encouraging, outperforming the null model, though further refinement is needed. In summary, the model shows promise but requires optimization to balance precision and recall effectively and maximize its potential.

4.5.5 Random Forests classifier

4.5.5.1 Build Random Forests model for feature combination 1

Build Random Forests model for feature combination 1

rfmodel1 <- randomForest(as.formula(paste("earning_class ~ ", paste(feature_combine1, collapse = ' + '))), data = youtube_train, ntree = 100)

Check the Model Performance

# train set AUC score 
rfmodel1_train_auc <- AUC_calculator(predict(rfmodel1, newdata=youtube_train), youtube_train[,target_column])
cat('rfmodel1\'s AUC score on the training set is',rfmodel1_train_auc, "\n")
## rfmodel1's AUC score on the training set is 1
# test set AUC score
rfmodel1_test_auc <- AUC_calculator(predict(rfmodel1, newdata=youtube_test), youtube_test[,target_column])
cat('rfmodel1\'s AUC score on the testing set is',rfmodel1_test_auc)
## rfmodel1's AUC score on the testing set is 0.9926471
# performance table
pretty_perf_table(rfmodel1,youtube_train[feature_combine1],youtube_train[,target_column] == positive_label,youtube_test[feature_combine1],youtube_test[,target_column] == positive_label)
## 
## 
## model                   precision   recall       f1   dev.norm
## --------------------- ----------- -------- -------- ----------
## rfmodel1 , training        1.0000   1.0000   1.0000     0.9558
## rfmodel1 , testing         0.9091   0.8824   0.8955     0.8115
# log likelihood
cat("rfmodel1 Log Likelihood:", calculateLogLikelihood(rfmodel1, youtube_test, positive_label, epsilon = 1e-6), "\n")
## rfmodel1 Log Likelihood: -0.6746506
cat("NULL model Log Likelihood:",logNull)
## NULL model Log Likelihood: -380.3058
# print ROC plot
pred_test_roc_rfmodel1 <- predict(rfmodel1, newdata=youtube_test)
pred_train_roc_rfmodel1 <- predict(rfmodel1, newdata=youtube_train)
plot_roc2(pred_test_roc_rfmodel1, youtube_test[[target_column]],
         pred_train_roc_rfmodel1, youtube_train[[target_column]])

While the AUC of the ‘Random Forest for all features’ model stands at a perfect 1.0, indicating an exceptional ability to distinguish between classes, the precision of 1.0 suggests that it might be overfitting to the training data. Although the recall is quite high at 0.9934, meaning it effectively captures the positive cases, a Log Likelihood value of -0.7402 raises concerns. This Log Likelihood suggests that the model performs worse than the null model. This paradoxical result is indicative of an issue with the model’s generalization. In summary, despite the seemingly perfect AUC and high recall, the Random Forest model demonstrates a suboptimal performance as indicated by the Log Likelihood and precision values.

4.5.5.2 Build Random Forests model for feature combination 2

Build Random Forests model for feature combination 2

rfmodel2 <- randomForest(as.formula(paste("earning_class ~ ", paste(feature_combine2, collapse = ' + '))), data = youtube_train, ntree = 100)

Check the Model Performance

# train set AUC score 
rfmodel2_train_auc <- AUC_calculator(predict(rfmodel2, newdata=youtube_train), youtube_train[,target_column])
cat('rfmodel2\'s AUC score on the training set is',rfmodel2_train_auc, "\n")
## rfmodel2's AUC score on the training set is 0.9999567
# test set AUC score
rfmodel2_test_auc <- AUC_calculator(predict(rfmodel2, newdata=youtube_test), youtube_test[,target_column])
cat('rfmodel2\'s AUC score on the testing set is',rfmodel2_test_auc)
## rfmodel2's AUC score on the testing set is 0.9601244
# performance table
pretty_perf_table(rfmodel2,youtube_train[feature_combine2],youtube_train[,target_column] == positive_label,youtube_test[feature_combine2],youtube_test[,target_column] == positive_label)
## 
## 
## model                   precision   recall       f1   dev.norm
## --------------------- ----------- -------- -------- ----------
## rfmodel2 , training        1.0000   0.9737   0.9867     1.0171
## rfmodel2 , testing         0.6667   0.8235   0.7368     0.9166
# log likelihood
cat("rfmodel2 Log Likelihood:", calculateLogLikelihood(rfmodel2, youtube_test, positive_label, epsilon = 1e-6), "\n")
## rfmodel2 Log Likelihood: -0.8320239
cat("NULL model Log Likelihood:",logNull)
## NULL model Log Likelihood: -380.3058
# print ROC plot
pred_test_roc_rfmodel2 <- predict(rfmodel2, newdata=youtube_test)
pred_train_roc_rfmodel2 <- predict(rfmodel2, newdata=youtube_train)
plot_roc2(pred_test_roc_rfmodel2, youtube_test[[target_column]],
         pred_train_roc_rfmodel2, youtube_train[[target_column]])

Despite the ‘Random Forest for feature combination 2’ model achieving an exceptional AUC score of 0.9999459 on the training data, reflecting strong discriminative capability, the model exhibits limitations in precision (1.0000) and recall (0.9737). These characteristics indicate a propensity for a high number of false positives, which can result in incorrect positive predictions. Furthermore, the Log Likelihood of -0.8883489 indicates suboptimal calibration, making it less reliable for accurately estimating probabilities. These issues, including less than ideal recall, can impact the model’s performance when making predictions.

4.5.5.3 Build Random Forests model for feature combination 3

Build Random Forests model for feature combination 3

rfmodel3 <- randomForest(as.formula(paste("earning_class ~ ", paste(feature_combine3, collapse = ' + '))), data = youtube_train, ntree = 100)

Check the Model Performance

# train set AUC score 
rfmodel3_train_auc <- AUC_calculator(predict(rfmodel3, newdata=youtube_train), youtube_train[,target_column])
cat('rfmodel3\'s AUC score on the training set is',rfmodel3_train_auc, "\n")
## rfmodel3's AUC score on the training set is 1
# test set AUC score
rfmodel3_test_auc <- AUC_calculator(predict(rfmodel3, newdata=youtube_test), youtube_test[,target_column])
cat('rfmodel3\'s AUC score on the testing set is',rfmodel3_test_auc)
## rfmodel3's AUC score on the testing set is 0.9800151
# performance table
pretty_perf_table(rfmodel3,youtube_train[feature_combine3],youtube_train[,target_column] == positive_label,youtube_test[feature_combine3],youtube_test[,target_column] == positive_label)
## 
## 
## model                   precision   recall       f1   dev.norm
## --------------------- ----------- -------- -------- ----------
## rfmodel3 , training        1.0000   0.9803   0.9900     1.2011
## rfmodel3 , testing         0.9394   0.9118   0.9254     0.9884
# log likelihood
cat("rfmodel3 Log Likelihood:", calculateLogLikelihood(rfmodel3, youtube_test, positive_label, epsilon = 1e-6), "\n")
## rfmodel3 Log Likelihood: -0.8115532
cat("NULL model Log Likelihood:",logNull)
## NULL model Log Likelihood: -380.3058
# print ROC plot
pred_test_roc_rfmodel3 <- predict(rfmodel3, newdata=youtube_test)
pred_train_roc_rfmodel3 <- predict(rfmodel3, newdata=youtube_train)
plot_roc2(pred_test_roc_rfmodel1, youtube_test[[target_column]],
         pred_train_roc_rfmodel1, youtube_train[[target_column]])

Despite ‘Random Forest for feature combination’ model demonstrating an exceptionally high AUC score of 0.9999892, emphasizing its excellent discrimination ability, it faces challenges in terms of precision and recall. With a precision of 1.0000 and recall of 0.9671, the model tends to produce more false positives and may not fully capture all positive instances. Furthermore, the Log Likelihood value of -0.6429955 indicates calibration issues, affecting the model’s accuracy in probability estimates. These deficiencies could hinder the model’s predictive reliability, particularly in situations where precision and recall are of utmost importance.

4.5.6 Compare all models

4.5.6.1 models performace

model_data <- data.frame(
  model.type = c("Decision Tree", "Decision Tree", "Decision Tree","Logistic Regression", "Logistic Regression", "Logistic Regression","SVM", "Random Forest", "Random Forest","Random Forest"),
  model.name = c("tmodel1_train_auc", "tmodel2_train_auc", "tmodel3_train_auc", "lrmodel1_test_auc", "lrmodel2_test_auc","lrmodel3_test_auc", "svm_model_test_auc","rfmodel1_train_auc", "rfmodel2_train_auc", "rfmodel3_test_auc"),
  train.auc = c(tmodel1_train_auc,tmodel2_train_auc,tmodel3_train_auc,lrmodel1_train_auc,lrmodel2_train_auc,lrmodel3_train_auc,svm_model_train_auc,rfmodel1_train_auc,rfmodel2_train_auc,rfmodel3_train_auc)
)
print(model_data)
##             model.type         model.name train.auc
## 1        Decision Tree  tmodel1_train_auc 0.9926744
## 2        Decision Tree  tmodel2_train_auc 0.9397236
## 3        Decision Tree  tmodel3_train_auc 0.9719150
## 4  Logistic Regression  lrmodel1_test_auc 0.9781315
## 5  Logistic Regression  lrmodel2_test_auc 0.9497706
## 6  Logistic Regression  lrmodel3_test_auc 0.9617599
## 7                  SVM svm_model_test_auc 0.9331285
## 8        Random Forest rfmodel1_train_auc 1.0000000
## 9        Random Forest rfmodel2_train_auc 0.9999567
## 10       Random Forest  rfmodel3_test_auc 1.0000000

4.5.6.2 ROC plot for each model

# Generic function to generate ROC curve plots
generate_roc_plot <- function(models, model_names, data, target_column) {
  library(ROCit)
  library(ggplot2)
  
  # Create an empty list to store prediction performance objects for each model
  model_preds <- vector("list", length(models))
  
  # Iterate over the list of models to generate prediction performance objects
  for (i in 1:length(models)) {
    pred <- prediction(predict(models[[i]], newdata = data), data[[target_column]])
    model_preds[[i]] <- pred
  }
  
  # Create an empty data frame to store ROC curve data for all models
  roc_data <- data.frame()
  
  # Iterate over each model's name and prediction performance object
  for (i in 1:length(models)) {
    model_name <- model_names[i]
    pred <- model_preds[[i]]
    
    # Extract TPR and FPR data
    perf <- performance(pred, "tpr", "fpr")
    
    # Extract TPR and FPR values and add them to the data frame
    roc_data <- rbind(roc_data, data.frame(
      model = model_name,
      fpr = unlist(perf@x.values),
      tpr = unlist(perf@y.values)
    ))
  }
  
  # Use ggplot2 to plot ROC curves for multiple models
  roc_plot <- ggplot(roc_data, aes(x = fpr, y = tpr, color = model)) +
    geom_line() +
    geom_abline(intercept = 0, slope = 1, linetype = 'dashed') +
    labs(x = "False Positive Rate (FPR)", y = "True Positive Rate (TPR)") +
    theme_minimal()
  
  return(roc_plot)
}

# Generate plot_all
models_all <- list(tmodel1, tmodel2, tmodel3, lrmodel1, lrmodel2, lrmodel3, svm_model, rfmodel1, rfmodel2, rfmodel3)
model_names_all <- c("tmodel1", "tmodel2", "tmodel3", "lrmodel1", "lrmodel2", "lrmodel3", "SVM_model", "rfmodel1", "rfmodel2", "rfmodel3")
roc_plot_all <- generate_roc_plot(models_all, model_names_all, youtube_train, target_column)

# Generate plot_tmodel
models_tmodel <- list(tmodel1, tmodel2, tmodel3)
model_names_tmodel <- c("tmodel1", "tmodel2", "tmodel3")
roc_plot_tmodel <- generate_roc_plot(models_tmodel, model_names_tmodel, youtube_train, target_column)

# Generate plot_lrmodel
models_lrmodel <- list(lrmodel1, lrmodel2, lrmodel3)
model_names_lrmodel <- c("lrmodel1", "lrmodel2", "lrmodel3")
roc_plot_lrmodel <- generate_roc_plot(models_lrmodel, model_names_lrmodel, youtube_train, target_column)

# Generate plot_rfmodel
models_rfmodel <- list(rfmodel1, rfmodel2, rfmodel3)
model_names_rfmodel <- c("rfmodel1", "rfmodel2", "rfmodel3")
roc_plot_rfmodel <- generate_roc_plot(models_rfmodel, model_names_rfmodel, youtube_train, target_column)

print(roc_plot_all)

At the same time, comparing the performance of these 10 different models and ROC images can make a preliminary judgment. First of all, for the random forest model, the AUC value of almost all random forest models is close to 1, which is almost a straight line in the ROC curve. This is an obvious trend of overfitting, so this model type is excluded. The SVM model performs well, but it has a limitation, that is, it can only analyze numerical variables, so the range that this model can predict is very limited. Then there is the logistic regression model. Although the AUC values are greater than 0.9, because the log likelihood cannot be obtained, the relationship with the null model cannot be compared and can only be excluded. Finally, in the decision tree model, the AUC value of the decision tree model for all variables is 0.99, which is very likely to be overfitting. If we change the data set, we may not be able to predict correctly. The AUC value of the decision tree model for feature combination3 is 0.9719, which is greater than the decision tree model using feature combination2, which is 0.939. The accuracy of tmodel3 is 0.9281, which is greater than the accuracy of tmodel2, which is 0.8696. The recall rate and f1 of tmodel3 are 0.9342 and 0.9311 respectively, which are also greater than the recall of tmodel2. The rate is 0.7895 and f1 is 0.8276, and the log likelihood value of tmodel3 is -0.02955777, which is much larger than the -2.079434 of tmodel2. In summary, the decision tree model using feature combination3 is the most effective model.

Part5 Clustering

Considering the diversity of readers, we will answer a few common questions at the beginning of Clustering:

Do you need a target variable for clustering?

No, clustering does not require a target variable. Clustering is an unsupervised learning method, which means it seeks to identify patterns or groups in the data without the use of previously labeled examples. It relies on the inherent structure and distribution of the data.

Which clustering technique will we use?

The choice of clustering technique often depends on the dataset and the specific problem at hand. Common methods include:

K-Means Clustering: Best suited for spherical clusters when the number of clusters (K) is known or can be estimated.

Hierarchical Clustering: Appropriate when a tree-like structure is preferred or when one wishes to understand the hierarchical relationships within the data.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Suitable for clusters of varying shapes and for data with noise.

For this particular task, we will employ both Hierarchical Clustering and K-Means Clustering. The reason for this is because Hierarchical Clustering offers a comprehensive visualization that aids in understanding the potential cluster groupings at various levels of granularity. On the other hand, K-Means Clustering is computationally efficient and works well when the number of clusters can be reasonably estimated.

How to set parameters? How do you determine an appropriate K value for clustering?

Parameter setting often combines domain knowledge, empirical testing, and experimentation. For K-Means, a typical technique to ascertain the K value is the Elbow Method. By plotting the sum of squared distances for various K values, the “elbow” of the curve represents an optimal value for K, striking a balance between precision and computational cost.

How do you evaluate clustering performance?

Evaluating clustering performance can be a challenge, especially given its unsupervised nature. Internal metrics such as the Silhouette Coefficient, Davies-Bouldin Index, and Calinski-Harabasz Index are used to gauge the quality of clusters without any external benchmark. If ground truth labels are available, even if not used in the clustering process, external metrics like the Adjusted Rand Index or Normalized Mutual Information can be employed.

What insights can be drawn from the clustering results?

Clustering can yield diverse insights depending on the context. In the realm of market segmentation, it can spotlight varied customer behavior patterns. In biology, it may identify previously unknown species or genes with analogous functions. It is crucial to interpret the clusters in the context of the domain and the specific problem being addressed.

What distinguishes clustering from classification?

While both clustering and classification are methods used for grouping data points, they serve divergent purposes. Clustering, an unsupervised learning approach, groups data points based on inherent similarities without predefined labels. In contrast, classification, a supervised learning method, predicts the predefined label of a data point leveraging its features. Classification has a clear target variable, whereas clustering seeks to discern structure within the data.

5.1 Data preprocessing

Data Import

In this section, we are working with a pre-processed and cleaned dataset to prepare it for clustering.

clustering_data <- clustering_data[,c("country","category","subscribers","video_views",
"uploads","video_views_last_30_days","monthly_earnings_low","monthly_earning_high","yearly_earning_low","yearly_earning_high","subs_last_30_days","tertiary_edu_enrollment","population","unemployment_rate" ,"urban_population")]

The frequency of occurrence of each country and category in the dataset was first calculated. Based on these frequencies, countries and categories with fewer occurrences were identified, and then these less frequent countries and “unknown” categories were replaced with the generic country “world” label and category “other” label to effectively aggregate the data. We do this first to reduce noise, and in large datasets, certain categories may be underrepresented.

During the clustering process, these sparse data points may be seen as noise or outliers. Aggregating them under common labels ensures that they do not adversely affect the overall clustering results, and secondly, this can enhance interpretability, the presence of secondary categories can complicate the interpretation of clustering results. Through aggregation, the data structure can be simplified to make the clustering results clearer and easier to interpret, as we need to consider different readers’ reading needs, and most of the audiences are not able to read complex clustering results quickly. Furthermore this can improve efficiency, dealing with too many separate categories may increase the computational complexity of the clustering algorithm. Reducing the number of categories can speed up the algorithm.

After we perform the operation, we again convert the data type to the Factor type to prevent the previous substitution will be non-existent Factor.

# Calculate the frequency of each country
country_counts <- table(clustering_data$country)
categories_counts <- table(clustering_data$category)

# Replace countries with fewer than 5 occurrences with "World", and categories with fewer than 5 occurrences with "Others"
countries_to_replace <- names(country_counts[country_counts < 5])
categories_to_replace <- names(categories_counts[categories_counts < 5])

# Convert factor columns to character type
clustering_data$country <- as.character(clustering_data$country)
clustering_data$category <- as.character(clustering_data$category)

# Replace the names of the countries in the list with "World"
clustering_data$country[clustering_data$country %in% countries_to_replace] <- "World"
clustering_data$country[clustering_data$country == "Unknown"] <- "World"

# Replace the names of the categories in the list with "Others"
clustering_data$category[clustering_data$category %in% categories_to_replace] <- "Others"

# Convert back to factor type
clustering_data$country <- as.factor(clustering_data$country)
clustering_data$category <- as.factor(clustering_data$category)

5.2 Data Scaling

5.3 Finding Best K

Due to the breadth of the audience present in the report, we will explain the role of each function all over again for better comprehensibility.

First, define helper functions to compute distances and evaluate cluster quality.

Then, define a function to calculate the Calinski-Harabasz Index using either k-means or hierarchical clustering. This function will be instrumental in determining the best number of clusters for the dataset.

Finally, compute the CH Index for the dataset, providing insights into the optimal number of clusters.

The sqr_euDist function computes the squared Euclidean distance between two points. This is a common distance measure used in clustering algorithms.The Calinski-Harabasz Index helps make this determination by measuring the ratio of the sum of between-cluster dispersion to within-cluster dispersion.

The functions wss, wss_total, and tss compute the Within-Cluster Sum of Squares, total WSS, and Total Sum of Squares, respectively. These metrics help evaluate the compactness of the clusters.Functions like WSS and TSS provide metrics that can help assess the quality of the clusters. They ensure the clusters are tight and well-separated.

The CH_index function computes the Calinski-Harabasz Index, a metric used to determine the optimal number of clusters. This function is designed to work with either hierarchical clustering or k-means clustering.By allowing the choice between kmeans and hclust in the CH_index function, the code provides flexibility in the clustering approach. Different datasets or use cases might favor one method over the other

# Define a function to compute the squared Euclidean distance between two points x and y
sqr_euDist <- function(x, y) {
  sum((x - y)^2)
}

# Define a function to compute the Within-Cluster Sum of Squares (WSS) for a given cluster matrix
wss <- function(clustermat) {
  c0 <- colMeans(clustermat)
  sum(apply(clustermat, 1, FUN=function(row) {sqr_euDist(row, c0)}))
}

# Define a function to compute the total WSS for all clusters
wss_total <- function(scaled_df, labels) {
  wss_sum <- 0
  k <- length(unique(labels))
  for (i in 1:k)
    wss_sum <- wss_sum + wss(subset(scaled_df, labels == i))
  wss_sum
}

# Define a function to compute the Total Sum of Squares (TSS)
tss <- function(scaled_df) {
  wss(scaled_df)
}

# Define a function to compute the Calinski-Harabasz Index (CH Index) using either hierarchical clustering or k-means clustering
# The CH Index is useful for determining the optimal number of clusters
CH_index <- function(scaled_df, kmax, method="kmeans") {
  if (!(method %in% c("kmeans", "hclust")))
    stop("method must be one of c('kmeans', 'hclust')")
  npts <- nrow(scaled_df)
  wss_value <- numeric(kmax)
  wss_value[1] <- wss(scaled_df)
  
  if (method == "kmeans") {
    for (k in 2:kmax) {
      clustering <- kmeans(scaled_df, k, nstart=10, iter.max=100)
      wss_value[k] <- clustering$tot.withinss
    }
  } else {
    d <- dist(scaled_df, method="euclidean")
    pfit <- hclust(d, method="ward.D2")
    for (k in 2:kmax) {
      labels <- cutree(pfit, k=k)
      wss_value[k] <- wss_total(scaled_df, labels)
    }
  }
  
  bss_value <- tss(scaled_df) - wss_value
  B <- bss_value / (0:(kmax-1))
  W <- wss_value / (npts - 1:kmax)
  data.frame(k = 1:kmax, CH_index = B/W, WSS = wss_value)
}

5.4 hierarchical clustering

The code first sets up the configurations for hierarchical clustering.

It then loops through each combination, performing clustering and visualizing the results. This iterative approach ensures that we evaluate multiple configurations to find the best one.

After examining the various dendrograms, we can finalize the clustering by extracting the cluster memberships. This final step provides actionable labels for each data point, allowing for further insights and analysis.

# Define the distance methods and linkage methods for hierarchical clustering
distance_methods <- c("manhattan", "euclidean")
linkage_methods <- c("average", "single", "ward.D2")

# Initialize a list to store the results of each clustering configuration
all_cluster_results <- list()

# Loop through each combination of distance and linkage methods
for (dist_method in distance_methods) {
  for (linkage_method in linkage_methods) {
    # Compute the distance matrix for the given distance method
    dist_matrix <- dist(clustering_numeric_data, method = dist_method)
    # Perform hierarchical clustering using the computed distance matrix and the given linkage method
    cluster_result <- hclust(dist_matrix, method = linkage_method)
    # Store the clustering result in the results list with a unique key based on the distance and linkage methods
    all_cluster_results[[paste(dist_method, linkage_method, sep="_")]] <- cluster_result
    # Plot the dendrogram for the clustering result
    plot(cluster_result, labels = FALSE, main = paste("Distance method:", dist_method, ", Linkage method:", linkage_method))
    rect.hclust(cluster_result, k=3)
  }
}

# Extract cluster memberships for the last clustering result with k=3
groups <- cutree(cluster_result, k=3)

The dendrogram is a visual representation of how the clusters are merged iteratively. The vertical lines on the dendrogram indicate the dissimilarity (or distance) at which two clusters are merged. A longer vertical line suggests a greater distance or dissimilarity.

Deciding on the Number of Clusters:

Elbow Method: An effective method to determine the ideal number of clusters is by looking for an “elbow” in the dendrogram. This is the point where the merging distance starts to increase significantly, indicating that further merges might not be meaningful. In many visually appealing plots, it’s often observed that cutting the dendrogram to produce three clusters is ideal, as the merging distance increases rapidly beyond this point. Selection of Distance and Linkage Methods:

Based on the output from rect.hclust, the combination of euclidean distance and single linkage seems to be the most effective. Thus, for our subsequent analysis, we’ll choose the euclidean distance with single linkage method.

Visual Clustering

Data Preparation

First, we use Principal Component Analysis (PCA) to reduce the high-dimensional data into a simpler form, preserving as much of the original variance as possible.

Then projects the data into a 2D space using the first two principal components. This projection facilitates visualization by translating the multi-dimensional data into a format that can be easily plotted on a two-dimensional plane.

The 2D projections are enriched by appending cluster assignments and country information to each data point. This makes the subsequent visualizations more informative, showing not only the cluster each point belongs to but also the country it’s associated with

We also visualize Cluster boundary. It provide a clear visual representation of each cluster’s spread in the 2D space, the code computes the convex hull for each cluster. A convex hull is the smallest convex polygon that encompasses all points of a cluster, effectively showing the boundary of the cluster.

# Perform principal component analysis (PCA) on the numeric data
princ <- prcomp(clustering_numeric_data)

# Number of components to retain for projection (in this case 2D)
nComp <- 2

# Project the data into the first two principal components
project2D <- as.data.frame(predict(princ, newdata=clustering_numeric_data)[,1:nComp])

# Combine the 2D projections with cluster groups and country information
hclust_project2D <- cbind(project2D, cluster=as.factor(groups), country=clustering_data$country)

# Define a function to compute the convex hull for each cluster in the 2D projection
find_convex_hull <- function(proj2Ddf, groups) {
  do.call(rbind,
          lapply(unique(groups),
                 FUN = function(c) {
                   f <- subset(proj2Ddf, cluster==c)
                   f[chull(f),]
                 }
          )
  )
}

# Compute the convex hull for each cluster
hclust_hull <- find_convex_hull(hclust_project2D, groups)

Visualize CH index plot and WSS plot.

# Compute the CH Index for the clustering numeric data using hierarchical clustering
crit.df <- CH_index(clustering_numeric_data, 10, method="hclust")

# Handle potential missing values
crit.df <- na.omit(crit.df)

# Plotting
# Plot the CH index against k values
CHIndexPlot <- ggplot(crit.df, aes(x=k, y=CH_index)) +
  geom_point() + geom_line(colour="red") +
  scale_x_continuous(breaks=1:10, labels=1:10) +
  labs(y="CH index") + theme(text=element_text(size=20))

# Plot the WSS against k values
WSSPlot <- ggplot(crit.df, aes(x=k, y=WSS)) +
  geom_point() + geom_line(colour="blue") +
  scale_x_continuous(breaks=1:10, labels=1:10) +
  labs(y="WSS") + theme(text=element_text(size=20))

# Display both plots side by side
grid.arrange(CHIndexPlot, WSSPlot, nrow=1)

This graph shows how the CH index (Calinski-Harabasz Index) and WSS (sum of squares within clusters) vary with the value of k (i.e., the number of clusters), respectively. Given the diversity of readers, we will analyze the graphs in order to increase readability

CH Index:

The CH index (Calinski-Harabasz Index) is a metric used to evaluate the effectiveness of clustering.

The CH index is the ratio of inter-cluster distance to intra-cluster distance. A high CH index indicates that points within clusters are very close to each other and relatively far from other clusters, which is what we want to see for good clustering.

We can see that the CH index reaches its maximum value at k=2, which means that k=2 provides the best ratio of inter-cluster distance to intra-cluster distance. That is, when there are only two clusters, intra-cluster similarity and inter-cluster dissimilarity are best.

WSS (intra-cluster sum of squares):

WSS measures the total distance from all points within a cluster to its cluster center. A smaller WSS means that points within a cluster are closer to its center, thus forming tighter clusters.

At increasing values of k, we expect the WSS to decrease because more clusters mean fewer points within each cluster, and thus less intra-cluster distance.

However, we are not seeking to minimize the WSS, but rather to find a suitable value of k at which increasing the number of clusters no longer significantly reduces the WSS. This “turning point” or “elbow” is often used to select the right value of k. In the graph we provided, the WSS is reduced by a factor of k.

The drop in WSS from k=2 to k=3 is the most significant, suggesting that k=2 to k=3 brings the greatest improvement in clustering. After that, the decrease in WSS slows down, which means that adding more clusters does not significantly improve the clustering. For WSS, instead of aiming for the highest value, we look for the “elbow”, where the WSS decline starts to slow down. k=3 neighborhood may be a suitable choice.

# Use ggplot visualize Clustering
ggplot(hclust_project2D, aes(x=PC1, y=PC2)) +
  geom_point(aes(shape=cluster, color=cluster)) +
  geom_text(aes(label=country, color=cluster), hjust=0, vjust=1, size=3) +
  geom_polygon(data=hclust_hull, aes(group=cluster, fill=as.factor(cluster)), alpha=0.4, linetype=0) +
  theme(text=element_text(size=20))

In the chart, we have a K of 3, with 3 squares, and there are overlapping parts between each square, with fewer parts overlapping in all three squares. We believe this is due to the fact that the data was chosen from the top 1,000 tanker bloggers, out of the millions of tanker bloggers, these 1,000 represent the head of all tanker bloggers, which inevitably leads to duplication of area, followed by the fact that many countries are duplicated multiple times, and although the other data varies from blogger to blogger, each feature is not independent, and they are linked to the country’s geographic location, which leads to overlapping area as well

5.6 K-means Clustering

The clusterboot method was first used to verify the stability of hierarchical clustering. This is to determine if we have chosen the right number of clusters. The stability value indicates how stable each cluster is when resampling the data.

Next, the code performs clustering using the k-means algorithm, where the optimal value of k is set to 5. The kmeans method is used to find the centers of the clusters and assign data points to the nearest centers.

Finally, the code uses the fpc library to compute the CH index and ASW values, which help evaluate the quality of clustering for different values of k. A higher CH index indicates greater inter-cluster variation and less intra-cluster variation, while ASW measures the cohesiveness of the clusters, with values closer to 1 indicating better clustering.

# Identify and display the two most stable clusters
cat("Thus, clusters", order(-stability_values)[1], "and", order(-stability_values)[2], "are highly stable")
## Thus, clusters 2 and 3 are highly stable
# Perform clustering using the k-means algorithm with the optimal value for k set to 5
optimalKValue <- 5
kMeansResult <- kmeans(clustering_numeric_data, optimalKValue, nstart=100, iter.max=100)

# Get the cluster results from k-means
clusterGroups <- kMeansResult$cluster

# Calculate CH index and ASW values for k values ranging from 1 to 10 using the fpc library
kMeansCH <- kmeansruns(clustering_numeric_data, krange=1:10, criterion="ch")
kMeansASW <- kmeansruns(clustering_numeric_data, krange=1:10, criterion="asw")

# Combine k-means CH index and ASW values into a single data frame
kMeansMetrics <- data.frame(k=1:10, CH_index=kMeansCH$crit, ASW=kMeansASW$crit)

First, we use the ggplot2 library to plot the CH index and ASW values against the k-value (number of clusters).The CH index is used to assess the differences between different clusters, while the ASW values assess the internal cohesion of each cluster. Both metrics are used to help select the best k-value.

Then, for k values from 2 to 5, we used the k-means algorithm for clustering and projected these results onto a 2D graph. Also, for each value of k, a convex hull is computed to represent the boundary of each cluster.

Finally, we display the clustering results of these four k-values side-by-side in a 2x2 grid using the grid.range function.

# Plot the CH index against k values
CHIndexPlot <- ggplot(kMeansMetrics, aes(x=k, y=CH_index)) +
  geom_point() + geom_line(colour="red") +
  scale_x_continuous(breaks=1:10, labels=1:10) +
  labs(y="CH index") + theme(text=element_text(size=20))

# Plot the ASW against k values
ASWPlot <- ggplot(kMeansMetrics, aes(x=k, y=ASW)) +
  geom_point() + geom_line(colour="blue") +
  scale_x_continuous(breaks=1:10, labels=1:10) +
  labs(y="ASW") + theme(text=element_text(size=20))

# Display both plots side by side
grid.arrange(CHIndexPlot, ASWPlot, nrow=1)

CH Index Chart:

In this graph, the CH index reaches a peak when k=2, which indicates that with k=2 we get relatively better clustering because the data points within the clusters are relatively centralized and the distances to other clusters are larger.

As the value of k increases, the CH index gradually decreases, which may mean that the increased clusters do not make the data better clustered.

ASW (Average contour coefficient) Graph.

Average contour coefficient is a measure of the difference between the average distance of a data point to other data points in its cluster and the average distance of the data point to the nearest other cluster. Values closer to 1 indicate better clustering, while values closer to -1 indicate worse clustering.

From the graph, ASW reaches a high value at k=2 and then fluctuates. This means that clustering is relatively better at k=2.

Between k=5 and k=6, there is a significant drop in ASW, which may imply that clustering is not as effective between these two k values.

Based on these two graphs, we can tentatively conclude that clustering is relatively better at k=2.

# For k values ranging from 2 to 5, plot the results of the k-means clustering
plotList <- list()
kRange <- 2:5
for (currentK in kRange) {
  currentClusters <- kmeans(clustering_numeric_data, currentK, nstart=100, iter.max=100)$cluster
  kMeansProjection2D <- cbind(project2D, cluster=as.factor(currentClusters), country=clustering_data$country)
  kMeansHull <- find_convex_hull(kMeansProjection2D, currentClusters)

  currentPlot <- ggplot(kMeansProjection2D, aes(x=PC1, y=PC2)) +
    geom_point(aes(shape=cluster, color=cluster)) +
    geom_polygon(data=kMeansHull, aes(group=cluster, fill=cluster), alpha=0.4, linetype=0) +
    labs(title = sprintf("k = %d", currentK)) +
    theme(legend.position="none", text=element_text(size=20))
  
  plotList[[currentK - 1]] <- currentPlot
}

# Display k-means clustering results in a 2x2 grid
grid.arrange(plotList[[1]], plotList[[2]], plotList[[3]], plotList[[4]], nrow=2)

k = 2:

The data are divided into two distinct clusters, one of which is predominantly in the negative region of PC1, and the other is concentrated in the region close to 0 of PC1. there is a clear boundary between the two clusters.

k = 3:

The data is divided into three clusters. Among them, the red and blue clusters are similar to the division at k = 2.

The added green cluster is mainly concentrated in the positive region of PC2, and there is also a clear boundary between it and the blue cluster.

k = 4:

The red cluster is further subdivided into two clusters (red and green).

The boundaries between the green and purple clusters do not appear to be particularly clear and there may be some overlap.

The blue cluster is relatively centralized.

k = 5:

The data is divided into five clusters, but here the boundaries between clusters start to become less distinct compared to the k = 4 case.

In the negative region of PC1, there are multiple clusters in close proximity to each other with possible overlap.

The green and yellow clusters do not seem to have very clear boundaries with the other clusters.

From the above analysis, we see that:

The clustering seems to work best when k=2 and k=3 because each cluster has clear boundaries and there is not much overlap.

When k increases to 4 or 5, the boundaries between clusters start to blur, especially at k=5, although we can identify more subclusters.

Combining the previous CH index and ASW analyses, and considering the clarity of clustering and the boundaries between clusters, we believe that k=2 or k=3 can be used as the optimal number of clusters.

Reference

[1] Elgiriyewithana, N. (2023). Global YouTube Statistics 2023. Kaggle. https://www.kaggle.com/datasets/nelgiriyewithana/global-youtube-statistics-2023

[2] Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org

[3] Wickham, H., François, R., Henry, L., & Müller, K. (2021). dplyr: A Grammar of Data Manipulation (R package version 1.0.7). https://dplyr.tidyverse.org

[4] Friendly, M. (2002). Corrgrams: Exploratory displays for correlation matrices. The American Statistician, 56(4), 316–324.

[5] McNamara, A., Zhu, H., Arino de la Rubia, E., Ellis, S., Lowndes, J., & Quinn, M. (2021). skimr: Compact and Flexible Summaries of Data (R package version 2.1.3). https://cran.r-project.org/web/packages/skimr/

[6] Healy, K. (2018). Data Visualization: A Practical Introduction. Princeton University Press.

[7] Joanes, D. N., & Gill, C. A. (1998). Comparing measures of sample skewness and kurtosis. The Statistician, 47(1), 183-189.

[8] Kuhn, M. (2019). The caret Package 4.1 Simple Splitting Based on the Outcome. http://topepo.github.io/caret/data-splitting.html

[9] Feng, C., Wang, H., Lu, N., Chen, T., He, H., Lu, Y., & Tu, X. M. (2014). Log-transformation and its implications for data analysis. Shanghai Arch Psychiatry, 26(2), 105-109.

[10] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.

[11] OpenAI. (2023). ChatGPT (August 3 Version) [Large language model]. https://chat.openai.com